Dynamic sub-route-based self-adaptive beam search Q-learning algorithm for traveling salesman problem

Jin Zhang; Qing Liu; XiaoHang Han

doi:10.1371/journal.pone.0283207

Abstract

In this paper, a dynamic sub-route-based self-adaptive beam search Q-learning (DSRABSQL) algorithm is proposed that provides a reinforcement learning (RL) framework combined with local search to solve the traveling salesman problem (TSP). DSRABSQL builds upon the Q-learning (QL) algorithm. Considering its problems of slow convergence and low accuracy, four strategies within the QL framework are designed first: the weighting function-based reward matrix, the power function-based initial Q-table, a self-adaptive ε-beam search strategy, and a new Q-value update formula. Then, a self-adaptive beam search Q-learning (ABSQL) algorithm is designed. To solve the problem that the sub-route is not fully optimized in the ABSQL algorithm, a dynamic sub-route optimization strategy is introduced outside the QL framework, and then the DSRABSQL algorithm is designed. Experiments are conducted to compare QL, ABSQL, DSRABSQL, our previously proposed variable neighborhood discrete whale optimization algorithm, and two advanced reinforcement learning algorithms. The experimental results show that DSRABSQL significantly outperforms the other algorithms. In addition, two groups of algorithms are designed based on the QL and DSRABSQL algorithms to test the effectiveness of the five strategies. From the experimental results, it can be found that the dynamic sub-route optimization strategy and self-adaptive ε-beam search strategy contribute the most for small-, medium-, and large-scale instances. At the same time, collaboration exists between the four strategies within the QL framework, which increases with the expansion of the instance scale.

Citation: Zhang J, Liu Q, Han X (2023) Dynamic sub-route-based self-adaptive beam search Q-learning algorithm for traveling salesman problem. PLoS ONE 18(3): e0283207. https://doi.org/10.1371/journal.pone.0283207

Editor: Shih-Wei Lin, Chang Gung University, TAIWAN

Received: June 5, 2022; Accepted: March 3, 2023; Published: March 21, 2023

Copyright: © 2023 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and instances are available in http://www.math.uwaterloo.ca/tsp/concorde/index.html.

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

For a set of given cities, the traveling salesman problem (TSP) is finding the shortest route along which a salesman visits all of the cities exactly once before returning to the starting point. The TSP is a well-known combinatorial optimization problem with applications in many fields [1], such as transportation, circuit board design, production scheduling, and logistics distribution.

As a traditional NP-hard problem, numerous approaches have been proposed to solve the TSP, most of which use exact and heuristic algorithms. Exact algorithms include branch-and-bound (BnB), cut-plane, integer programming, and dynamic programming, all of which are used to obtain the global optimal solution by continuous iteration. For example, Pekny et al. (1990) [2] and Pesant et al. (1998) [3] proposed the BnB method and its variants for the TSP problem, and the famous TSP solver Corconde (http://www.math.uwaterloo.ca/tsp/concorde/index.html) is based on the BnB algorithm. Sanches et al. (2017) [4] proposed a partitioned cross-improvement initial solution method to speed up Corconde. However, since the time cost of the exact algorithm increases exponentially with the size of the instance, they are not suitable for large-scale applications.

Heuristic algorithms are widely used because of their high computational efficiency, and they can obtain a sub-optimal solution in a reasonable timeframe. Representative heuristic algorithms include the Lin-Kernighan heuristic (LKH), ant colony optimization (ACO) algorithm, genetic algorithm (GA), particle swarm optimization (PSO) algorithm, whale optimization algorithm (WOA), and gray wolf optimization (GWO). Based on this, various improved algorithms have been studied for solving the TSP problem. For example, by modifying the heuristic rules of LKH to improve its search strategy, Helsgaun (2000) [5] proposed an improved LKH algorithm. Ebadinezhad et al. (2020) [6] proposed an ACO algorithm that included a dynamic evaporation strategy (DEACO). Wang et al. (2022) [7] proposed a fine-grained fast parallel GA algorithm based on a ternary optical computer, and Zheng et al. (2022) [8] proposed a transfer learning-based PSO algorithm. Zhang et al. (2020) [9] proposed a variable neighborhood discrete WOA algorithm (VDWOA), while Panwar et al. (2021) [10] proposed a novel discrete GWO algorithm. However, heuristic algorithms are mostly based on random search, which lacks both an ability to learning and a theoretical foundation. They also usually require unique heuristic rules for a given problem. For this reason, designing unified solutions for combinatorial optimization problems such as TSP has become a popular research topic in machine learning.

As one of the main machine-learning methods, reinforcement learning (RL) has strong decision making and autonomous learning capabilities. RL is based on the Markov decision process (MDP), which is a sequential decision mathematical model that has natural similarity to the TSP. Therefore, many recent algorithms have adopted RL for TSP, and they can be divided into three categories.

Deep learning algorithm combined with RL. The deep learning (DL) algorithm combined with RL exploits the perception ability of DL and the decision-making ability of RL. Its fast solution speed and strong generalization ability give this combination great potential in finding approximate TSP solutions. For example, Vinyals et al. (2015) [11] used supervised learning for training in a pointer network for TSP. Bello et al. (2016) [12] used an actor–critic policy gradient method in a recurrent neural network. Dai et al. (2017) [13] proposed an S2V-DQN algorithm using a graph embedding network trained by deep Q-learning. Deudon et al. (2018) [14] trained a neural network for TSP by policy gradient using the reinforcement learning rule with a critic. Ma et al. (2019) [15] used hierarchical RL for training in a graph pointer network for TSP. Stohy et al. (2021) [16] used an actor-critic method for training in a hybrid pointer network. These algorithms use RL to train neural networks by testing a large number of small-scale TSP datasets, which consumes a great deal of time and resources, and the accuracy of the solution is not ideal. The essence of this type of algorithm is deep learning, using RL only as a training method.
Heuristic algorithm combined with RL. The heuristic algorithm combined with RL uses the strong generalization ability of RL to improve the heuristic algorithm to obtain better and faster solutions. For example, Liu et al. (2009) [17] proposed an improved GA algorithm with reinforcement mutation, using the genetic algorithm as the framework and reinforcement learning as a mutation operator, making it essentially a heuristic algorithm. Alipour et al. (2018) [18] proposed a hybridization of GAs and a multi-agent reinforcement learning (MARL) heuristic, which uses GA as a tour improvement heuristic and MARL as a construction heuristic. Costa et al. (2020) [19] proposed a 2-opt heuristic algorithm combined with deep RL for TSP, which essentially enhances the learning process of 2-opt. Combining the advantages of Q-learning (QL), Sarsa, and the Monte Carlo algorithm, Zheng et al. (2020) [20] proposed a variable strategy reinforced (VSR) approach, optimized the k-opt process of LKH based on this, and designed VSR-LKH for TSP. Optimizing the parameters of the biased random-key genetic algorithm (BRKGA) by QL, Chaves et al. (2021) [21] proposed a BRKGA-QL algorithm for TSP. These algorithms are still heuristic, using RL only to optimize parameters or strategies.
RL algorithm. The above two categories of algorithms are not RL; they use RL for training or optimization. Algorithms based on RL are less common. For example, Ottoni et al. (2018) [22] proposed two RL algorithms for TSP, the QL and Sarsa (state action reward state action) algorithms, where the RL parameter estimation is based on a response surface model (RSM). The algorithm only optimizes the parameters of RL, and does not improve the algorithm framework, so the results are worse than the above two types of algorithms. For this reason, we design the improved QL-based algorithms in this paper to find a better QL framework to increase the efficiency of this kind of algorithm and provide a new way to solve the TSP. The principal contributions are as follows:

We introduce an improved QL algorithm for solving the TSP. A new reward matrix is constructed that is able to evaluate the immediate reward of the action more accurately compared to the original reward matrix. To reduce the size of the initial search space, a new initial Q-table is constructed. To balance exploration and exploitation, a new action selection strategy is designed. To more accurately assess the actions of the agent, a new Q-value update formula is designed. A local search strategy is also designed to enhance the algorithm’s local optimization capability.

The remainder of this paper is organized as follows. Section 2 presents the basic theoretical concept of RL, and a QL algorithm for the TSP problem is constructed. After analysis of the algorithm, some improvements are examined. In section 3.1, based on four proposed improvement strategies, an adaptive beam search QL algorithm is constructed, and a dynamic sub-route-based adaptive beam search QL algorithm is studied in section 3.2 to further improve performance of the algorithm. Section 4 discusses our experiments. Section 5 presents conclusions and suggested future work.

2. Principle and methodology of QL algorithm to solve TSP

2.1. Theoretical foundation

RL is an important machine-learning method proposed by Minsky in 1956. Its model is based on MDP, which shows significant advantage and potential in sequential decision problems. Its powerful exploration and autonomous learning abilities [23] are widely used in game-playing [24–26], robotic control [27, 28], and transportation [29–31].

As a methodological framework for learning, prediction, and decision-making, the agent learns the optimal policy by interacting with the environment [32]. Based on different ways of action selection, RL can be categorized as value-based, policy-based, or actor–critic learning. QL and Sarsa are two representative value-based RL algorithms. Considering that the value-based algorithm is applicable to a discrete action space, QL requires fewer parameters and has a better exploration ability [33], and is likely to find a better solution than Sarsa; thus, we focus on QL algorithms for TSP.

As a model-free RL algorithm, in QL, the agent continually interacts with the environment to obtain rewards, and the action-selection strategy can be evaluated and optimized to maximize the cumulative reward [34–36].

The QL algorithm can be described as an MDP, which is usually represented by a five-tuple, <S, A, P, R, γ> [37]. S is the state of the environment, s (s∈S) represents a certain state, and the environment is the learning object in which the agent performs actions to change its state. A is the action set of the agent, and a (a∈A) represents a certain action. As a learner, the agent tries all possible actions to cognize the environment to explore the proper actions. P is a matrix of state transition probabilities, and p(s,a)∈P is the probability that the agent performs action a under state s, and the selection of action is determined by policy π. R is the reward matrix, and r(s,a)∈R is the immediate reward obtained by the agent after performing action a in status s. γ is the discount factor that determines the ratio between the immediate and cumulative rewards, whose higher value indicates a greater emphasis on the latter.

QL utilizes the Q-table and reward matrix R to respectively evaluate the cumulative and immediate rewards of actions. The Q-value is updated as follows. Given state of the environment s_t of step t, the agent selects action a_t under the ε-greedy strategy to affect the environment so that the state is transferred to s_t+1, and an immediate reward r(s_t,a_t) is obtained. Then, the Q-value is updated as Formula (1). (1) where q(s_t,a_t) is the cumulative reward of taking action a_t in state s_t; is the maximum cumulative reward that can be obtained by taking action a_t+1 in state s_t+1; α is a learning factor, α∈[0,1]; and γ is a discount factor, γ∈[0,1].

The ε-greedy strategy guides the agent to choose the action with the largest Q-value with probability ε and to randomly select the other action with probability 1−ε.

The pseudocode of QL is shown as Algorithm 1.

Algorithm 1. QL

1. Begin

2. Construct the tuple;

3. Initialize Q-table and reward matrix R;

4. Define parameters α, γ, ε;

5. Initialize the maximum number of iterations I_max;

6. while (current number of iterations< I_max)

7. Initialize t and s_t;

8. While (s_t is not the terminal state)

9. Select action a_t according to ε-greedy policy in state s_t;

10. Calculate the immediate reward r(s_t,a_t);

11. Update q(s_t,a_t) according to Formula (1);

12. t = t + 1;

13. end while

14. end while

15. Obtain optimal policy according to Q-table.

16.End

2.2. Problem mapping of QL algorithm for TSP

Solving the TSP using the QL algorithm requires the construction of five-tuples, with the following parameters.

n: total number of cities;

t: number of visited cities, 1≤t≤n;

S_t: set of visited cities after visiting t cities;

s_t: number of t-th visited city, s_t∈S_t;

A_t: set of unvisited cities after visiting t cities;

a_t: number of next city to be visited after city s_t, a_t∈A_t;

i: number of current city, 1≤i≤n, which is used as the row value of the matrix P, R, Q, D;

j: number of next city to be visited from current city, 1≤j≤n, which is used as the column value of the matrix P, R, Q, D;

I_c: current iteration number;

I_max: maximum iteration number;

P: action selection probability matrix, where p(i,j) is the probability of visiting city j from city i;

R: reward matrix, also known as the immediate reward matrix, where r(i,j) is the immediate reward obtained by visiting city j from city I;

Q: Q-table, also known as the cumulative reward matrix, where q(i,j) is the cumulative reward obtained by visiting city j from city i;

D: distance matrix, where d(i,j) is the distance from city i to city j.

2.3. Construction of proposed QL algorithm

The process of the algorithm for solving the TSP can be described as follows. Starting from city s_t (the initial value of t is 1), select city a_t from A_t and visit it according to the ε-greedy strategy, determine the immediate and cumulative rewards from s_t to a_t. Repeat this process until all cities are visited.

(1) Construct reward matrix R, where each element is calculated as

(2)

(2) Construct initial Q-table, where each element is calculated as

(3)

(3) Determine the ε-greedy strategy as

(4)

Formula (4) shows that we start from city s_t and visit city a_t with a certain probability. The city with the largest Q-value has a probability of ε to be visited, while the other (n−t−1) cities have a probability of (1−ε) /(n−t−1) to be visited.

The pseudocode of the QL algorithm for the TSP is shown as Algorithm 2.

Algorithm 2. QL algorithm for TSP

1. Input: TSP instance with n cities

2. Output: optimal solution

3. Begin

4. Initialize the Q according to Formula (3);

5. ε = 0.95;

6. I_c = 0;

7. while (I_c < I_max)

8. Randomly select a city to be the initial one and denoted it as s₁;

9. S₁ = {s₁};

10. A₁ = {a₁, a₂, …, a_n-1};

11. t = 1;

12. while (t ≤ n)

13. Determine the next city a_t to be visited from s_t according to Formula (4);

14. Calculate the immediate reward r(s_t,a_t) from city s_t to a_t according to Formula (2);

15. Update the cumulative reward q(s_t,a_t) from city s_t to a_t according to Formula (1);

16. A_t+1 = A_t-{a_t};

17. s_t+1 = a_t;

18. S_t+1 = S_t∪{s_t+1};

19. t = t + 1;

20. end while

21. Calculate and save the length of n cities access sequence in set S_n;

22. I_c = I_c + 1;

23. end while

24. The route corresponding to S_n with the shortest route length will be the optimal one.

25.End

2.4. Analysis of results

We selected 18 instances of 30–1655 cities in TSPLIB to test the proposed QL algorithm. The parameters α, γ, and ε of the algorithm were 0.1, 0.1, and 0.95, respectively. The results are compared with those of the VDWOA and DWOA algorithms [9] in the same environment. As only 12 instances are listed in [9], results of the six other instances were tested by VDWOA and DWOA; the details are listed in Table 2.

The results are as follows. For instances with fewer than 500 cities, the QL algorithm is 2–40 times faster than the comparative algorithms; for instances with 500–1000 cities, it is 4–60 times faster, and for instances with more than 1000 cities, it is 5–40 times faster. The QL algorithm has distinct advantages in solution time, but the accuracy is relatively poor.

The proposed QL algorithm (Algorithm 2) is only a simple simulation of the traditional QL algorithm (Algorithm 1), which does not consider the characteristics of TSP and the defects of the QL algorithm itself. There are four reasons for its poor accuracy:

Formula (3) for calculating elements of reward matrix R only considers the distance between the current city i and city j, the next to be visited, without considering the impact of the distance between city j and subsequent cities. This is too greedy, and can lead to a high probability of selecting the local optimum city.
The Q-table is simply initialized with 0, giving all cities the same initial Q-value, so that the algorithm searches nearly blindly in the early stage, resulting in a large search range and decreasing the convergence speed. Due to the great randomness of city selection in the early stage, the probability of worse cities being selected may be increased, causing the algorithm to easily fall into local optima.
Cities are selected based on a ε-greedy strategy. If ε is too large, the selection probability of the global optimal city will be reduced, and the search scope smaller, causing premature convergence. If ε is too small, the probability of selecting a globally worse city will increase, and the search scope of the algorithm will be larger, causing difficult convergence.
There is no new city to visit after the last city, and all that remains is to return to the starting city. Thus, there is no need to consider the cumulative reward, and Formula (1) is no longer suitable for updating the Q-value.

3. QL-based algorithms for TSP

We studied several methods to improve the QL algorithm based on the above analysis.

3.1. Self-adaptive beam search QL algorithm

We propose a self-adaptive beam search QL algorithm, ABSQL. Four strategies are introduced: the weighting function-based reward matrix, power function-based initial Q-table, self-adaptive ε-beam search strategy, and a new formula for updating the Q-value.

3.1.1. Weighting function-based reward matrix.

For reason (1) above, we draw on the idea of the finite discount cumulative reward strategy [38]. Based on Formula (2), we introduce d_min and d_avg to consider the effect of the distance from city j to subsequent cities from local and global dimensions, respectively, and compute (5) where θ₁ and θ₂ control the influence degree of city i on city j and its successor, respectively, θ₁∈(0.5,1), θ₂∈(0.5,1); and d_min and d_avg are the minimum and average distances, respectively, from city j to the other cities to be visited.

The above strategy (denoted as stgy_1) synthetically considers the minimum and average distance between the current city to be visited and subsequent cities. stgy_1 can mitigate the greediness of the original reward matrix and increase the accuracy of rewards to find a higher-quality solution.

3.1.2. Power function-based initial Q-table.

For reason (2) above, since the distances between cities will affect city selection, the reward matrix constructed according to Formula (5) will positively affect the selection of cities. Therefore, we introduce a power function to cause positive correlation between the initial Q-table and the improved reward matrix. The improved initial Q-table is constructed as (6) where δ is a parameter to control the correlation between q(i,j) and r´(i,j), δ∈(1.0,1.2).

The improved initial Q-table (denoted as stgy_2) can reduce the search scope and enhance the search ability in the early stage, enabling faster convergence and a better-quality solution.

3.1.3. Self-adaptive ε-beam search strategy.

For reason (3) above, considering that the beam search algorithm has the characteristics of fast operation and a tendency to avoid local optima, a self-adaptive ε-beam search strategy is designed. By improving the fixed ε to be self-adaptive and replacing the greedy search with a beam search, the probability of action selection can be properly controlled.

(1) Self-adaptive ε strategy

In the early stage of the algorithm, the search behavior should be relatively random and the search scope should be large. In this stage, the algorithm focuses on exploration to avoid falling into local optima. In the later stage, the search scope should be small, and the algorithm turns to exploitation, while the convergence speed is gradually accelerated. Exploration and exploitation can be controlled by adjusting the value of ε. With a larger value, the algorithm tends more toward exploitation. Therefore, the value of ε should increase with iteration. The improved calculation is (7) where b and c are parameters to control the respective initial and final values of ε.

(2) Beam search strategy

Beam search is a heuristic graph search algorithm [39–41] that will gradually cut off some poor-quality nodes and reserve some high-quality nodes. It is usually adopted when the solution space of the graph is relatively large, to decrease the space and time cost of the search.

Based on a beam search, we sort Q-values of unvisited cities in descending order, and then select the top several as the next candidates to be visited. Considering the time cost and complexity of the algorithm, the number of candidate cities is set to 2, i.e., the current optimal or the second optimal city is selected as the next visited city. Compared with the greedy strategy, which only keeps the current optimal city, it has a larger search scope, which can increase the possibility of selecting the global optimal city, and makes the algorithm less likely to fall into local optima.

In summary, the self-adaptive ε-beam search strategy (denoted as stgy_3) can be described as (8) where formula (a) indicates that starting from city s_t, the city with the maximum Q-value is selected with probability ε´, formula (b) indicates that the city with the sub-maximum Q-value is selected with probability (1-ε´) * ε´, and formula (c) indicates that other cities are selected with probability (1-ε´)²/(n-t-2) in A_t.

The improved strategy can balance exploration and exploitation, expand the search scope in the early stage, and enhance the search ability in the middle and later stages.

3.1.4. Improvement of Q-value update formula.

For reason (4) above, the Q-value of the last visited city does not contain long-term rewards, so the value calculated by using Formula (1) that contains long-term rewards is inevitably higher than its real Q-value. Furthermore, the influence of such overestimation will gradually accumulate with iteration, making the estimation of the Q-table more inaccurate, affecting the judgment of city selection. Therefore, Formula (1) is no longer suitable for updating the Q-value of the last visited city.

The last visited city has no other city to visit and needs only to return to the starting city; hence the long-term reward in Formula (1) should not be considered. The improved Q-value update formula is (9) where formula (b) is used for the last visited city, and formula (a) for all other cities.

The improved Q-value update formula (denoted as stgy_4) can make the values of the Q-table more accurate when evaluating city selection.

The pseudocode of the ABSQL algorithm is presented as Algorithm 3.

Algorithm 3. ABSQL algorithm for TSP

1. Input: TSP instance with n cities

2. Output: optimal solution

3. Begin

4. Initialize the Q table according to Formula (6);

5. I_c = 0;

6. while (I_c < I_max)

7. Calculate ε´ according to Formula (7);

8. Select one city randomly to be the starting one and denoted it as s₁;

9. Let S₁ = {s₁}, A₁ = {a₁, a₂, …, a_n-1} and the number of cities visited t = 1;

10. while (t≤n)

11. Determine the next city a_t to be visited from s_t according to Formula (8);

12. Calculate the immediate return value r´(s_t,a_t) according to Formula (5);

13. if (t! = n)

14. Update the cumulative reward q´(s_t,a_t) from city s_t to a_t according to Formula (9a);

15. else if (t = = n)

16. Update the cumulative reward q´(s_t,a_t) from city s_t to a_t according to Formula (9b);

17. end if

18. A_t+1 = A_t-{a_t};

19. s_t+1 = a_t;

20. S_t+1 = S_t∪{s_t+1};

21. t = t + 1;

22. end while

23. Calculate and save the length of n cities access sequence in set S_n;

24. I_c = I_c + 1;

25. end while

26. The route corresponding to S_n with the shortest route length will be the optimal one.

27.End

3.1.5. Analysis of results.

We used the 18 instances in Section 2.4 to test the effectiveness of the proposed ABSQL algorithm through analysis of the length and graph of the route. The parameters α and γ used in the ABSQL algorithm are the same as in the QL algorithm. The newly added parameters θ₁, θ₂, δ, b and c are 0.8, 0.8, 1.12, 0.25, and 1.05, respectively.

(1) Route length analysis

We compare solutions obtained by the ABSQL algorithm with those of QL and VDWOA, with results as shown in Table 2. Denote min, avg, and max to be the minimum, average, and maximum values, respectively, with deviation rates (10) (11) (12) where opt is the known optimal solution.

The comparison of ABSQL and QL shows that ABSQL has a faster solution, the quality of which is greatly improved.

For the 18 instances, min of ABSQL is better than that of QL, and its S_min is 0–92% less than that of QL. Comparing ABSQL and VDWOA, in terms of time, ABSQL exceeds VDWOA in all instances. In terms of solution quality, the following results are obtained.

For 14 small-scale instances with the total number of cities less than 500, the min of ABSQL is slightly worse than that of VDWOA for eight instances, and its S_min is 0.08%–5.7% greater. However, the avg of ABSQL has clear advantages over VDWOA in 10 instances, and its S_avg is 0.31%–7.25% less. For four medium- and large-scale instances with more than 500 cities, the min of ABSQL is better than that of VDWOA, and its S_min is 0.67%–5.11% less. Since VDWOA takes too long for medium- and large-scale instances, it cannot obtain enough solutions in a limited time to calculate avg. Therefore, the avg and S_avg of the two algorithms are not compared.

The above analysis shows that ABSQL has a better ability than VDWOA for medium- and large-scale instances. However, as comparing opt values reveals, ABSQL still has some deficiencies. For small-scale instances, the S_min of ABSQL is 0–10.95%, and for medium- and large-scale instances, the S_min of ABSQL is 12.34%–14.31%, i.e., the S_min of ABSQL increases with the scale of the problem, so the solution accuracy of ABSQL requires further improvement.

(2) Route diagram analysis

Analyzing the route diagrams obtained from the above instances, a common problem is found: one or more sub-routes are not completely optimized. We select instance bays29 to analyze this problem from two aspects. Fig 1 is an incompletely optimized route diagram, and Fig 2 is an optimal route diagram. The X- and Y-axes represent the abscissa and ordinate, respectively, of the city; circled numbers indicate serial numbers of cities; solid lines are routes between cities. The three incompletely optimized sub-routes in Fig 1 and their corresponding fully-optimized sub-routes in Fig 2 are marked by oval circles, respectively.

Download:

Fig 1. Route for bays29 output by ABSQL algorithm.

https://doi.org/10.1371/journal.pone.0283207.g001

Download:

Fig 2. Optimal route of bays29.

https://doi.org/10.1371/journal.pone.0283207.g002

1) Cross or overlap between edges

In Fig 1, sub-route 11-17-22-14 is incompletely optimized, with a crossover between edges (11,17) and (22,14). In Fig 2, there is no crossover of edges, sub-route 11-22-17-14 is fully optimized, and sub-route 11-17-22-14 can be optimized to 11-22-17-14 by swapping nodes 17 and 22.

2) No cross or overlap between edges

In Fig 1, there is no crossover or overlap between edges of sub-route 23-8-24-27-16. However, the sub-route is not completely optimized. This is mainly because the sum of the lengths of edges (23,27) and (27,8) is less than that of (16,27) and (27,24). Sub-route 23-27-8-24-16 of Fig 2 is optimal. Sub-route 23-8-24-27-16 can be optimized to 23-27-8-24-16 by swapping nodes 8 and 24, and then nodes 24 and 27. Similarly, sub-route 21-3-29-26 of Fig 1 can be improved to the better sub-route 21-29-3-26 of Fig 2 by swapping nodes 3 and 29.

Through the above analysis, it can be seen that the incomplete optimization of sub-routes is the main reason for the relatively poor solution of the ABSQL algorithm compared with the known optimum. Considering that the route length can be shortened by swapping some nodes of the sub-route, we introduce the 2-opt strategy to further optimize the sub-route.

3.2. Dynamic sub-route-based ABSQL algorithm

To solve the problem of incomplete optimization of sub-routes, static and dynamic sub-route optimization strategies are designed to optimize routes output by ABSQL.

3.2.1. Static sub-route optimization.

In this section, we perform following operations. Denote the route output by ABSQL as S_n. A partition operation is performed first on S_n, as follows. Starting from the first node, successively divide S_n (which has n nodes) into ⌊n/m⌋ sub-routes (each with m nodes), and denote these as sr₁, sr₂… sr_⌊n/m⌋. If there is a sub-route with fewer than m nodes, it is denoted as rp, and the set of all sub-routes is Sr = {sr₁, sr₂, …, sr_⌊n/m⌋, rp}. Because the number of nodes in each sub-route is fixed, the partition process is called static sub-route optimization. For sub-routes with four or more nodes in Sr, all nodes except the first and last are selected to perform 2-opt optimization, and the optimized set is denoted as Sr¹. The optimization process is as follows.

Step 1: Select the second node in the sub-route as the starting node, as well as other nodes for (m−2) *(m−3)/2 instances of 2-opt optimization;

Step 2: If the length of the sub-route does not decrease after 2-opt optimization, the optimization of the sub-route is stopped; otherwise, the new sub-route is reserved and we return to step 1.

3.2.2. Dynamic sub-route optimization.

Since static sub-route optimization does not consider optimization of the first and last nodes of a sub-route, and the number of nodes on each sub-route is fixed, a dynamic optimization strategy is introduced in which the first and last nodes are considered at the same time. Dynamic optimization is performed based on Sr¹ obtained by static optimization, so that the number of nodes on each sub-route can change over the optimization process, which is as follows.

Step 1: Suppose the number of current optimization rounds is h, with initial value 1, h≤L. The set of sub-routes after the h-th round of optimization is Sr^h = {sr₁^h, sr₂^h, …, sr_f(h)-1^h, rp^h}, where sr₁^h to rp^h represent the first to f(h)-th sub-routes, respectively; f(h) is calculated according to Formula (13), and L is the total number of optimization rounds, as determined by Formula (14).

(13)

(14)

Step 2: All sub-routes in set Sr^h except for rp^h are merged in pairs in order, and if sr_f(h)-1^h is left, it is merged with rp^h;

Step 3: For each sub-route obtained in step 2, select nodes at 1/4 to 3/4 positions of the node sequence, and perform 2-opt optimization to obtain set Sr^h+1;

Step 4: If h≤L, then let h = h+1, and return to step 2; otherwise, merge all sub-routes in Sr^L+1 to obtain the new path.

By combining ABSQL with the dynamic sub-route optimization strategy (denoted as stgy_5), a dynamic sub-route-based self-adaptive beam search Q-learning algorithm (DSRABSQL) for TSP is designed.

The pseudocode of the above algorithm is shown as Algorithm 4.

Algorithm 4. DSRABSQL algorithm for TSP

1. Input: S_n of ABSQL algorithm

2. Output: optimal solution

3. Begin

4. Initialize m;

5. S_n obtains Sr according to the division operation in Section 3.4.1

6. Optimize Sr according to step 1 and step 2 in Section 3.4.1 to get Sr¹

7. Initialize L according to Formula (14);

8. Initialize h = 1;

9. while (h ≤ L)

10. Execute step 2 and step 3 of section 3.4.2

11. h = h + 1;

12. end while

13. Merge all sub-routes in Sr^L+1 in order to obtain an optimal route.

14.End

3.2.3. Analysis of results.

We used the 18 instances in Section 2.4 to test the proposed DSRABSQL algorithm. The parameters α, γ, θ₁, θ₂, δ, b and c used in the DSRABSQL algorithm are the same as in the ABSQL algorithm, with the added parameter m of 10. The performance of DSRABSQL is analyzed by comparing its optimal route diagrams with those of ABSQL.

Since most instances have too many cities or an uneven distribution, the route diagrams are not clear. We list the route diagrams of berlin52 and tsp225 for illustration.

Figs 3 and 4 are the optimal routes of berlin52 and tsp225, respectively, as obtained by ABSQL. Figs 5 and 6 are the corresponding routes as obtained by DSRABSQL.

Download:

Fig 3. The optimal route of Berlin52 output by ABSQL algorithm.

https://doi.org/10.1371/journal.pone.0283207.g003

Download:

Fig 4. The optimal route of Tsp225 output by ABSQL algorithm.

https://doi.org/10.1371/journal.pone.0283207.g004

Download:

Fig 5. The optimal route of Berlin52 output by DSRABSQL algorithm.

https://doi.org/10.1371/journal.pone.0283207.g005

Download:

Fig 6. The optimal route of Tsp225 output by DSRABSQL algorithm.

https://doi.org/10.1371/journal.pone.0283207.g006

From the analysis of the route diagram, it can be found that the sub-routes marked in Figs 3 and 4 are not completely optimized, and the sub-routes in the corresponding positions of Figs 5 and 6 are completely optimized. Therefore, by enhancing the local optimization ability of ABSQL, DSRABSQL greatly improves the solution accuracy of TSP.

4. Experiments and analysis of results

The simulation is carried out in Spyder and implemented on an Intel Core i5-7500 CPU at 3.4 GHz, with 8 GB memory and a Windows 10 (64-bit) operating system. We take 18 TSP instances and test them with city’s scale from 30 to 1655. The instances that used for this experiment is taken from TSPLIB (http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/) library. Table 1 lists the simulation parameters of the three algorithms to elaborate on the characteristics of the parameters adopted by each algorithm.

Download:

Table 1. Simulation parameters for the QL, ABSQL and DSRABSQL algorithms.

https://doi.org/10.1371/journal.pone.0283207.t001

The values of the parameters listed in Table 1 were obtained through experimental tests, and the specific procedures are as follows.

Procedure 1: Determine α, γ and ε of the QL algorithm.

Set ε based on QL-related books and literature;
Set the initial value, final value and step size of parameters α and γ, and select the optimal values of these parameters according to the test results.

Procedure 2: Determine θ₁, θ₂, δ, b and c of the ABSQL algorithm.

Set the initial values, final values and step sizes of parameters θ₁, θ₂ in stgy_1, parameter δ in stgy_2, and parameters b and c in stgy_3 in succession, and select the optimal values of these parameters according to the test results;
The parameters α and γ in stgy_4 inherit the optimal values of α and γ from the QL algorithm.

Procedure 3: Determine parameter m of the DSRABSQL algorithm.

Set the initial value, final value and step size of parameter m in stgy_5, and select the optimal value of these parameters based on the test results.

The QL, ABSQL and DSRABSQL algorithms are iterated for I_max times, respectively. The results are taken in n_t simulation runs for each instance. I_max and n_t are calculated as (15) (16)

Table 2 lists min (S_min), avg (S_avg), max (S_max), S_td, and the average time cost of the 18 instances tested by QL, ABSQL, and DSRABSQL, and the corresponding values of VDWOA and DWOA.

Download:

Table 2. Results of proposed algorithms and algorithms in [9] for 18 instances.

https://doi.org/10.1371/journal.pone.0283207.t002

From Table 2, it can be seen that min, avg, max, and S_td of DSRABSQL are mostly better than those of VDWOA for 18 instances, which indicates that the solution quality and stability of DSRABSQL are superior to those of VDWOA.

To further analyze the performance of the algorithm, the data of Table 2 are discussed in two aspects: the first is the data comparison of QL, ABSQL and DSRABSQL, and the second is the data comparison between DSRABSQL and VDWOA.

For the first aspect, a small- (ch150), medium- (rat575), and large-scale (fl400) instance are selected to analyze the performance of DSRABSQL. Figs 7–9 illustrate min, avg, max, S_td, and N_min (the number of mins) obtained by using QL, ABSQL, and DSRABSQL, respectively, to test ch150, rat575, and fl1400 under different numbers of iterations. Since the time cost of the algorithm will increase with scale of the instance, ch150, rat575, and fl1400 are tested 50, 20, and 10 times, respectively.

Download:

Fig 7. min, avg, max and S_td of ch150 obtained by three algorithms.

https://doi.org/10.1371/journal.pone.0283207.g007

Download:

Fig 8. min, avg, max and S_td for rat575 obtained by three algorithms.

https://doi.org/10.1371/journal.pone.0283207.g008

Download:

Fig 9. min, avg, max and S_td of fl1400 obtained by three algorithms.

https://doi.org/10.1371/journal.pone.0283207.g009

Download:

Fig 10. The value of N_min for ch150, rat575 and fl1400 obtained by three algorithms.

https://doi.org/10.1371/journal.pone.0283207.g010

A chi-square test is performed on N_min as obtained for ch150 under 400 and 500 iterations of DSRABSQL, as shown in Fig 10, where N_min is the number of mins, and I_t is the number of iterations. With 1 degree of freedom, we assume Hypothesis 0: N_min and I_t are correlated.

For this purpose, the Sample Frequency Contingency Table of N_min and I_t, as shown in Table 3, must be made, and a chi-square test based on the fourfold data in Table 3 is run, with the test statistic χ² calculated as 6.895.

Download:

Table 3. Sample frequency contingency table of N_min and I_t.

https://doi.org/10.1371/journal.pone.0283207.t003

From Table 2 and Figs 7~10, it can be seen that min, avg, max, S_td, and N_min of DSRABSQL are better than those of QL and ABSQL, which indicates that the solution quality, stability, and convergence of DSRABSQL are all significantly improved compared with QL and ABSQL throughout the iterative process. Table 3 shows that χ²∈(6.635, 7.879), and the probability p that Hypothesis 0 is true is between 0.99 and 0.995, which indicates that N_min and I_t are significantly correlated. Similarly, N_min and I_t of instances rat575 and fl1400 can also be inferred to be correlated. Therefore, N_min of DSRABSQL increases with I_t until achieving stability, i.e., the probability of DSRABSQL for getting its minimum value min will increase with I_t until stability. The above analysis shows that DSRABSQL has learning ability and can be continuously enhanced with iterations until stability.

For the second aspect, the Mann-Whitney-Wilcoxon test is used to verify the performance of DSRABSQL relative to VDWOA based on the data of Table 2. A parameter, (17) is introduced to evaluate the performance of the algorithm under different scale instances. Since S_min, S_avg, and S_max have different effects on performance of the algorithm, we assign them different weights, in this case 0.5, 0.4, and 0.1, respectively. The smaller S_c is, the better the comprehensive performance of the algorithm is.

The values of S_c for 18 instances of DSRABSQL and VDWOA are calculated, with results as shown in Tables 3 and 4. To verify that DSRABSQL shows more advantages in solving medium- and large-scale instances, the 18 instances are divided into groups, G₁ and G₂. Group G₁ includes 14 small-scale instances, and G₂ includes 4 medium- and large-scale instances. In group G₁, S_c of DSRABSQL is marked as Sample 1, and that of VDWOA as Sample 2; for group G₂, S_c of DSRABSQL is marked as Sample 3, and that of VDWOA as Sample 4. Then, the following hypotheses are made:

Hypothesis 1(a): S_c of DSRABSQL in G₁ and G₂ is not significantly different from that of VDWOA;

Hypothesis 1(b): S_c of DSRABSQL in G₁ and G₂ differ significantly from that of VDWOA.

Download:

Table 4. S_c and rank of DSRABSQL and VDWOA algorithms in G₁.

https://doi.org/10.1371/journal.pone.0283207.t004

Then, S_c values of Samples 1 and 2 are sorted in ascending order. R₁ and R₂ are the respective rank sums of Samples 1 and 2. The Mann-Whitney-Wilcoxon test is performed on R₁ and R₂ to obtain statistics U₁ and U₂, and the results are shown in Table 4. Similarly, R₃ and R₄ are the respective rank sums of Samples 3 and 4, respectively, and U₃ and U₄ are obtained by performing the same process; the results are shown in Table 5.

Download:

Table 5. S_c and rank of DSRABSQL and VDWOA algorithm in G₂.

https://doi.org/10.1371/journal.pone.0283207.t005

For U₁ and U₃ of Tables 4 and 5, it can be found that U₁ = 27<47 = U_{α = 0.02}<U_{α = 0.025} by checking the Mann-Whitney Table. U₁ is lower than U_α after Bonferroni correction, U₃ = 0 = U_{α = 0.05}, so the probability p that Hypothesis 1(b) is true is 0.95. This shows that DSRABSQL algorithm has significant advantages in all scale instances compared with VDWOA.

Fig 11 also shows N_min as obtained by testing instances kroa100, ch150, d198, and fl417 for 50 times by DSRABSQL and VDWOA under the condition of convergence. The results show that min of DSRABSQL is better than that of VDWOA, and when the algorithms converge, DSRABSQL has a much higher probability of obtaining min.

Download:

Fig 11. The N_min of DSRABSQL and VDWOA algorithms obtained by testing kroa100, ch150, d198, fl417 50 times, respectively.

https://doi.org/10.1371/journal.pone.0283207.g011

To further verify the significance difference of DSRABSQL and VDWOA for groups G1 and G2, Vargha and Delaney’s Â was used to compute its effect size, and the following results are obtained: Â₂₁ = 0.86>0.71 and Â₄₃ = 1>0.71. This means that probability of the DSRABSQL algorithm is better than VDWOA for instances of all scales is about 0.86–1. Combined with the conclusions obtained from Fig 10 and Table 2, it can be further concluded that DSRABSQL has better learning ability than the population- and random search-based VDWOA.

The above analysis shows that performance of DSRABSQL is a significant improvement compared with QL, ABSQL, and VDWOA, because DSRABSQL undergoes two stages of improvement based on the QL algorithm. The first stage is from QL to ABSQL, with the proposed stgy _1–stgy_4 strategies, which are all improvement methods within the QL framework. The second stage is from ABSQL to DSRABSQL, with the proposed stgy_5 strategy, which is an improvement method outside the QL framework. We compare the improvement effects of these strategies in what follows.

Two groups of comparative experiments are designed. Group 1 includes QL and five comparison algorithms, which are obtained by combining one of the five strategies above with QL, and denoted as QL+1,…, QL+4, QL +5. QL+1–QL+4 are improvements within the QL framework, and QL+5 is an improvement outside the QL framework. Group 2 includes DSRABSQL and five comparison algorithms, which are obtained by removing one of the five strategies above from DSRABSQL, and denoted as DSR-1,…, DSR-4, DSR-5. DSR-1–DSR-4 are improvements outside the QL framework, and DSR-5 is an improvement within the QL framework. The relationships between the algorithms and the five strategies are shown in Table 6.
For every algorithm of each group, instances ch150, rat575, and fl1400 are tested 50, 20, and 10 times, respectively. The experiment results of ch150, rat575, and fl1400 are illustrated by the boxplots of Figs 12–14, respectively.
For each comparison algorithm, according to the position distribution of the boxes shown in these figures, the effects of these strategies on improvement of the algorithms are judged. The values of min, avg and max in the boxplot are converted to S_c for processing, and the results are shown in Table 7.

Download:

Fig 12. Boxplots obtained by testing ch150 for 50 times by each algorithm.

https://doi.org/10.1371/journal.pone.0283207.g012

Download:

Fig 13. Boxplots obtained by testing rat575 for 20 times by each algorithm.

https://doi.org/10.1371/journal.pone.0283207.g013

Download:

Fig 14. Boxplots obtained by testing fl1400 for 10 times by each algorithm.

https://doi.org/10.1371/journal.pone.0283207.g014

Download:

Table 6. Table of relationships between algorithms and five strategies for two experimental groups.

https://doi.org/10.1371/journal.pone.0283207.t006

Download:

Table 7. S_c of ch150, rat575 and fl1400 for algorithms of each group.

https://doi.org/10.1371/journal.pone.0283207.t007

According to Figs 12~14, two kinds of comparisons are performed, as follows:

(1) Horizontal comparison

For instance, for ch150 in experimental Group 1, the improvement degrees of the QL+2, QL+4, QL+1, QL+3, and QL+5 algorithms increase sequentially compared with QL. In experimental Group 2, the performance of DSR-2, DSR-4, DSR-1, DSR-3, and DSR-5 deteriorates gradually compared with DSRABSQL.

For rat575 and fl1400 in experimental Group 1, the improvement degree of QL+2, QL+4, QL+1, QL+3, and QL+5 increases sequentially compared with QL. In experimental Group 2, the performance of DSR-2, DSR-4, DSR-1, DSR-5, and DSR-3 deteriorates gradually compared with DSRABSQL.

(2) Vertical comparison

In experimental Group 1, for ch150, rat575, and fl1400, the improvement degree of QL+2, QL+4, QL+1, QL+3, and QL+5 increases sequentially compared with QL, and with expansion of the instance scale, the solution quality of QL+3 is constantly approaching that of QL+5.

In experimental Group 2, for ch150, the performance of DSR-2, DSR-4, DSR-1, DSR-3, and DSR-5 deteriorates gradually compared with DSRABSQL. For rat575 and fl1400, the performance of DSR-2, DSR-4, DSR-1, DSR-5, and DSR-3 deteriorates gradually compared with DSRABSQL. It can be also found that, with the increase of the instance scale, the gap between DSR-3 and the other four algorithms increases gradually, but the solution quality of DSR-1 and DSR-4 constantly approaches that of DSR-5.

The above analysis shows that for small-scale instances, stgy_5 makes the greatest contribution to QL and DSRABSQL, indicating its key role in early and later improvement stages of the algorithm. For medium- and large-scale instances, stgy_5 contributes most to the QL algorithm, and stgy_3 contributes most to DSRABSQL, which indicates that stgy_5 plays a key role in the early improvement stage of the algorithm, and that stgy_3 plays a key role in the later improvement stage. In conclusion, stgy_5 has the greatest potential for small-scale instances, and stgy_3 for medium- and large-scale instances. stgy_1 and stgy_4 also have certain potential for large-scale instances, while stgy_2 has no obvious influence for any scale of instances.

In addition, from the difference of S_c of each algorithm in Table 7, it can be found that there are collaborations between strategies. To describe this phenomenon in detail, the following data are calculated based on Table 7. Since stgy_5 is the only one of the five strategies outside the QL framework, and it has the best improvement without the collaboration of other strategies, we use this as the reference strategy. The difference of S_c between QL+5 and QL+k (k∈{1,2,3,4}) is denoted as d_5-k, which is calculated as (18) and the difference of S_c between DSR-5 and DSR-k is denoted as d´_5-k, which is calculated as (19) where S_c ^QL+k, S_c ^QL+5, S_c ^DSR-k, and S_c ^DSR-5 represent the S_c value of the corresponding instances of QL+k, QL+5, DSR-k, and DSR-5, respectively. For algorithm QL+k, no collaboration exists between stgy_1~ stgy_5, but it does for DSR-k. These algorithms are divided into four classes according to their adopted strategies and the relevant data are listed in Table 8.

Download:

Table 8. The difference of S_c between stgy_5 and stgy_1~ stgy_4 under collaboration or not.

https://doi.org/10.1371/journal.pone.0283207.t008

From Table 8, the following results can be found. For ch150 of class I, S_c of QL combined with stgy_1 (QL+1) is 2.14% larger than that of QL combined with stgy_5 (QL+5); however, the S_c of QL combined with stgy_1, stgy_2, stgy_3 and stgy_4 (DSR-5) is 1.42% larger than that of QL combined with stgy_2, stgy_3, stgy_4 and stgy_5 (DSR-1), which indicates that with the collaboration of stgy_2, stgy_3 and stgy_4, the difference of the improvement effect between stgy_1 and stgy_5 is decreased gradually. The same is found for classes II–IV. For rat575 and fl1400, similar results can also be deduced.

In other words, the improvement effect of stgy_1–stgy_4 is all worse than stgy_5 when no collaboration exists. However, under collaboration, the difference of the improvement effect between stgy_1–stgy_4 and stgy_5 gradually decreases. In addition, the improvement effect of stgy_1 and stgy_3 is sometimes better than that of stgy_5, which indicates a better collaboration effect from stgy_1–stgy_4, but stgy_5 is relatively independent.

In class I, for ch150, the S_c of DSR-5 is 1.42% larger than that of DSR-1; for rat575, the S_c of DSR-5 is 0.75% larger than that of DSR-1; and for fl1400, the S_c of DSR-5 is 0.2% less than that of DSR-1. This all indicates that with the expansion of the instance scale and with the collaboration of stgy_2, stgy_3, and stgy_4, the difference of the improvement effect between stgy_1 and stgy_5 decreases until it overtakes that of stgy_5. The same is found for classes II–IV. Therefore, the collaboration effect between stgy_1–stgy_4 is approaching being better than that of stgy_1–stgy_5’s increase with instance scale.

The above analysis shows that DSRABSQL has certain advantages in solving the TSP. To further verify the effectiveness of the algorithm, from categories 1 and 2 described in section 1, we select S2V-DQN and the 2-opt heuristic algorithm combined with deep reinforcement learning (2-opt-DRL for short) from each category to compare with DSRABSQL. Table 9 lists the min and average of S_min obtained by DSRABSQL and these two algorithms for 25 instances.

Download:

Table 9. The min obtained by DSRABSQL, S2V-DQN and 2-opt-DRL algorithms for 25 instances.

https://doi.org/10.1371/journal.pone.0283207.t009

For these 25 instances, compared with S2V-DQN, the results of 23 instances obtained by DSRABSQL are better, and compared with 2-opt-DRL, those of 19 instances are better. In general, the average of S_min for DSRABSQL is 0.33 times that of S2V-DQN and 0.18 times that of 2-opt-DRL, which shows that the solution results of DSRABSQL still have a clear advantage compared with those of RL.

5. Conclusion and future work

We proposed a dynamic sub-route-based self-adaptive beam search Q-learning algorithm for medium- to large-scale TSPs. On the basis of the QL algorithm, a weighting function-based reward matrix, a power function-based initial Q-table, a self-adaptive ε-beam search strategy, a new Q-value update formula, and a dynamic sub-route optimization strategy were introduced. The first four are improvement strategies within the QL framework, and the last is outside of it.

Considering the comparative algorithms, the DSRABSQL algorithm not only has certain advantages in solution time, results, and solution stability, but strong local optimization and good learning abilities. However, there are certain deficiencies compared with the current best heuristic algorithms, such as LKH and DEACO. Therefore, it is necessary to further improve DSRABSQL.

For the five proposed improvement strategies for DSRABSQL, the experimental results show that they can accelerate the convergence speed of the algorithm, enhance its search and local optimization ability, and balance exploration and exploitation. In addition, there are strong collaboration effects among the four strategies within the QL framework, which will become stronger with expansion of the instance scale. Although the dynamic sub-route optimization strategy has the most obvious improvement effect on small-scale instances, its relative independence will lead to a fixed improvement effect, which will be limited as the problem size increases. The self-adaptive ε-beam search strategy cannot only make the greatest contributions to medium- and large-scale problems; it has strong collaboration effects with other strategies within the QL framework, and its advantage can be enhanced with increase of the instance size. Therefore, further improvement to the self-adaptive ɛ-beam search strategy will be studied.

DSRABSQL still faces challenges for solving larger-scale problems. Moreover, the many artificial parameters increase the complexity of parameter adjustment and may result in poor solutions for certain types of problems.

In future work, we hope to improve the self-adaptive ε-beam search strategy, enhance the collaboration effects between strategies, and adjust the parameters of each strategy to find a robust QL framework to improve the solution quality of large-scale problems and obtain a model with a better solution capability.

References

1. Lawler EL. The traveling salesman problem: a guided tour of combinatorial optimization. Wiley-Inter science Series in Discrete Mathematics. 1986; 18(5): 514–515. https://doi.org/10.1112/blms/18.5.514
- View Article
- Google Scholar
2. Pekny JF, Miller DL. An Exact Parallel Algorithm for the Resource Constrained Traveling Salesman Problem with Application to Scheduling with an Aggregate Deadline. In: ACM Conference on Computer Science; 1990. p. 208–214. https://doi.org/10.1145/100348.100380
- View Article
- Google Scholar
3. Pesant G, Gendreau M, Rousseau PM. An Exact Constraint Logic Programming Algorithm for the Traveling Salesman Problem with Time Windows. Transportation Science. 1998; 32(1): 12–29. https://doi.org/10.1287/trsc.32.1.12
- View Article
- Google Scholar
4. Sanches D, Whitley D, Tinos R. Improving an exact solver for the traveling salesman problem using partition crossover. In: Genetic and Evolutionary Computation Conference; 2017. p. 337–344. https://doi.org/10.1145/3071178.3071304
- View Article
- Google Scholar
5. Helsgaun K. An effective implementation of the Lin–Kernighan traveling salesman heuristic. European journal of operational research. 2000; 126(1): 106–130. https://doi.org/10.1016/S0377-2217(99)00284-2
- View Article
- Google Scholar
6. Ebadinezhad S. DEACO: Adopting dynamic evaporation strategy to enhance ACO algorithm for the traveling salesman problem. Engineering Applications of Artificial Intelligence. 2020; 92: 103649. https://doi.org/10.1016/j.engappai.2020.103649
- View Article
- Google Scholar
7. Wang Z, Shen Y, Li S, Wang S. A fine-grained fast parallel genetic algorithm based on a ternary optical computer for solving traveling salesman problem. The Journal of Supercomputing. 2022: 1–31. https://doi.org/10.1007/s11227-022-04813-9
- View Article
- Google Scholar
8. Zheng R, Zhang Y, Yang K. A transfer learning-based particle swarm optimization algorithm for travelling salesman problem. Journal of Computational Design and Engineering. 2022; 9(3): 933–948. https://doi.org/10.1093/jcde/qwac039
- View Article
- Google Scholar
9. Zhang J, Hong L, Liu Q. An Improved Whale Optimization Algorithm for the Traveling Salesman Problem. Symmetry. 2020; 13(1): 48–60. https://doi.org/10.3390/sym13010048
- View Article
- Google Scholar
10. Panwar K, Deep K. Discrete Grey Wolf Optimizer for symmetric travelling salesman problem. Applied Soft Computing. 2021; 105: 107298. https://doi.org/10.1016/j.asoc.2021.107298
- View Article
- Google Scholar
11. Vinyals O, Fortunato M, Jaitly N. Pointer networks. In: Advances in Neural Information Processing Systems, vol. 28; 2015.
- View Article
- Google Scholar
12. Bello I, Pham H, Le QV, Norouzi M, Bengio S. Neural combinatorial optimization with reinforcement learning. In: 5th International Conference on Learning Representations; 2019. https://doi.org/10.48550/arXiv.1611.09940
- View Article
- Google Scholar
13. Khalil E, Dai H, Zhang Y, Dilkina B, Song L. Learning combinatorial optimization algorithms over graphs. In: Advances in Neural Information Processing Systems, vol. 30; 2017. p. 6348–6358.
- View Article
- Google Scholar
14. Deudon M, Cournut P, Lacoste A, Adulyasak Y, Rousseau LM. Learning heuristics for the tsp by policy gradient. In: International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, vol. 10848; 2018. p. 170–181. https://doi.org/10.1007/978-3-319-93031-2_12
- View Article
- Google Scholar
15. Ma Q, Ge S, He D, Thaker D, Drori I. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. In: Association for the Advancement of Artificial Intelligence, 2019. https://doi.org/10.48550/arXiv.1911.04936
- View Article
- Google Scholar
16. Stohy A, Abdelhakam HT, Ali S, Elhenawy M, Hassan AA, Masoud M, et al. Hybrid pointer networks for traveling salesman problems optimization. Plos One. 2021; 16(12): e0260995. pmid:34905571
- View Article
- PubMed/NCBI
- Google Scholar
17. Liu F, Zeng G. Study of genetic algorithm with reinforcement learning to solve the TSP. Expert Systems with Applications. 2009; 36(3): 6995–7001. https://doi.org/10.1016/j.eswa.2008.08.026
- View Article
- Google Scholar
18. Alipour MM, Razavi SN, Derakhshi MRF, Balafar MA. A hybrid algorithm using a genetic algorithm and multiagent reinforcement learning heuristic to solve the traveling salesman problem. Neural Computing and Applications. 2018; 30(9): 2935–2951.https://doi.org/10.1007/s00521-017-2880-4
- View Article
- Google Scholar
19. Costa PRdO, Rhuggenaath J, Zhang Y, Akcay A. Learning 2-opt heuristics for the traveling salesman problem via deep reinforcement learning. In: Asian Conference on Machine Learning, vol. 129; 2020. p. 465–480.
- View Article
- Google Scholar
20. Zheng J, He K, Zhou J, Li CM. Combining reinforcement learning with Lin-Kernighan-Helsgaun algorithm for the traveling salesman problem. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35; 2021. p. 12445–12452. https://doi.org/10.1609/aaai.v35i14.17476
- View Article
- Google Scholar
21. Chaves AA, Lorena L. An Adaptive and Near Parameter-Free BRKGA Using Q-Learning Method. In: 2021 IEEE Congress on Evolutionary Computation; 2021. p. 2331–2338. https://doi.org/10.1109/CEC45853.2021.9504766
- View Article
- Google Scholar
22. Ottoni ALC, Nepomuceno EG, de Oliveira MS. A response surface model approach to parameter estimation of reinforcement learning for the travelling salesman problem. Journal of Control, Automation and Electrical Systems. 2018; 29(3): 350–359. https://doi.org/10.1007/s40313-018-0374-y
- View Article
- Google Scholar
23. Beom HR, Cho HS. A sensor-based navigation for a mobile robot using fuzzy logic and reinforcement learning. IEEE transactions on Systems, Man, and Cybernetics. 1995; 25(3): 464–477. https://doi.org/10.1109/21.364859
- View Article
- Google Scholar
24. Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, et al. Mastering the game of go with deep neural networks and tree search. Nature. 2016; 529(7587): 484–489. pmid:26819042
- View Article
- PubMed/NCBI
- Google Scholar
25. Moravčík M, Schmid M, Burch N, Lisý V, Morrill D, Bard N, et al. DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science. 2017; 356(6337): 508–513. pmid:28254783
- View Article
- PubMed/NCBI
- Google Scholar
26. Ontanón S, Synnaeve G, Uriarte A, Richoux F, Churchill D, Preuss M. A survey of real-time strategy game AI research and competition in StarCraft. IEEE Transactions on Computational Intelligence and AI in games. 2013; 5(4): 293–311. https://doi.org/10.1109/TCIAIG.2013.2286295
- View Article
- Google Scholar
27. Kober J, Bagnell JA, Peters J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research. 2013; 32(11): 1238–1274. https://doi.org/10.1177/0278364913495721
- View Article
- Google Scholar
28. Ng AY, Kim HJ, Jordan MI, Sastry S. Autonomous helicopter flight via reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 16; 2003.
- View Article
- Google Scholar
29. Wei H, Zheng GJ, Yao HX, Li ZH. Intellilight: A reinforcement learning approach for intelligent traffic light control. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018. p. 2496–2505. https://doi.org/10.1145/3219819.3220096
- View Article
- Google Scholar
30. Wang D, Qi F. Trajectory planning for a four-wheel-steering vehicle. In: Proceedings 2001 International Conference on Robotics and Automation, vol. 4; 2001. p. 3320–3325. https://doi.org/10.1109/ROBOT.2001.933130
- View Article
- Google Scholar
31. Li MN, Qin ZW, Jiao Y, Yang YD, Wang J, Wang CX, et al. Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. In: the World Wide Web Conference, 2019. p. 983–994. https://doi.org/10.1145/3308558.3313433
- View Article
- Google Scholar
32. Akalin N, Loutfi A. Reinforcement learning approaches in social robotics. Sensors. 2021; 21(4): 1292–1328. pmid:33670257
- View Article
- PubMed/NCBI
- Google Scholar
33. Rummery GA, Niranjan M. On-line Q-learning using connectionist systems. Cambridge: Cambridge University Press; 1994.
34. Watkins CJCH, Dayan P. Q-learning. Machine learning. 1992; 8(3–4): 279–292. https://doi.org/10.1007/BF00992698
- View Article
- Google Scholar
35. Zamfirache IA, Precup RE, Roman RC, Petriu EM. Reinforcement Learning-based Control Using Q-learning and Gravitational Search Algorithm with Experimental Validation on a Nonlinear Servo System. Information Sciences. 2021; 583: 99–120. https://doi.org/10.1016/j.ins.2021.10.070
- View Article
- Google Scholar
36. Huynh TN, Do DTT, Lee J. Q-Learning-based parameter control in differential evolution for structural optimization. Applied Soft Computing. 2021; 107: 107464–107477. https://doi.org/10.1016/j.asoc.2021.107464
- View Article
- Google Scholar
37. Liu Q, Zhai JW, Zhang ZZ, Zhong S, Zhou Q, Zhang P. A survey on deep reinforcement learning. Chinese Journal of Computers, 2018; 41(1): 1–27.
- View Article
- Google Scholar
38. Mao GJ, Gu SM. An Improved Q-Learning Algorithm and Its Application in Path Planning. J. Taiyuan Univ. Technol. 2021; 52(1): 91–97.
- View Article
- Google Scholar
39. Lowerre BT. The harpy speech recognition system. Carnegie Mellon University; 1976.
- View Article
- Google Scholar
40. Zhang JF, Sun YT, Li GC, Wang YH, Sun JB, Li JX. Machine-learning-assisted shear strength prediction of reinforced concrete beams with and without stirrups. Engineering with Computers. 2020; 5: 1–15. https://doi.org/10.1007/s00366-020-01076-x
- View Article
- Google Scholar
41. Kai-Wen L, Tao Z, Rui W, et al. Research reviews of combinatorial optimization methods based on deep reinforcement learning. Acta Automatica Sinica, 2021, 47(11): 2521–2537.
- View Article
- Google Scholar

[ref1] 1. Lawler EL. The traveling salesman problem: a guided tour of combinatorial optimization. Wiley-Inter science Series in Discrete Mathematics. 1986; 18(5): 514–515. https://doi.org/10.1112/blms/18.5.514
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Pekny JF, Miller DL. An Exact Parallel Algorithm for the Resource Constrained Traveling Salesman Problem with Application to Scheduling with an Aggregate Deadline. In: ACM Conference on Computer Science; 1990. p. 208–214. https://doi.org/10.1145/100348.100380
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Pesant G, Gendreau M, Rousseau PM. An Exact Constraint Logic Programming Algorithm for the Traveling Salesman Problem with Time Windows. Transportation Science. 1998; 32(1): 12–29. https://doi.org/10.1287/trsc.32.1.12
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Sanches D, Whitley D, Tinos R. Improving an exact solver for the traveling salesman problem using partition crossover. In: Genetic and Evolutionary Computation Conference; 2017. p. 337–344. https://doi.org/10.1145/3071178.3071304
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Helsgaun K. An effective implementation of the Lin–Kernighan traveling salesman heuristic. European journal of operational research. 2000; 126(1): 106–130. https://doi.org/10.1016/S0377-2217(99)00284-2
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Ebadinezhad S. DEACO: Adopting dynamic evaporation strategy to enhance ACO algorithm for the traveling salesman problem. Engineering Applications of Artificial Intelligence. 2020; 92: 103649. https://doi.org/10.1016/j.engappai.2020.103649
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Wang Z, Shen Y, Li S, Wang S. A fine-grained fast parallel genetic algorithm based on a ternary optical computer for solving traveling salesman problem. The Journal of Supercomputing. 2022: 1–31. https://doi.org/10.1007/s11227-022-04813-9
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. Zheng R, Zhang Y, Yang K. A transfer learning-based particle swarm optimization algorithm for travelling salesman problem. Journal of Computational Design and Engineering. 2022; 9(3): 933–948. https://doi.org/10.1093/jcde/qwac039
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Zhang J, Hong L, Liu Q. An Improved Whale Optimization Algorithm for the Traveling Salesman Problem. Symmetry. 2020; 13(1): 48–60. https://doi.org/10.3390/sym13010048
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref10] 10. Panwar K, Deep K. Discrete Grey Wolf Optimizer for symmetric travelling salesman problem. Applied Soft Computing. 2021; 105: 107298. https://doi.org/10.1016/j.asoc.2021.107298
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref11] 11. Vinyals O, Fortunato M, Jaitly N. Pointer networks. In: Advances in Neural Information Processing Systems, vol. 28; 2015.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref12] 12. Bello I, Pham H, Le QV, Norouzi M, Bengio S. Neural combinatorial optimization with reinforcement learning. In: 5th International Conference on Learning Representations; 2019. https://doi.org/10.48550/arXiv.1611.09940
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref13] 13. Khalil E, Dai H, Zhang Y, Dilkina B, Song L. Learning combinatorial optimization algorithms over graphs. In: Advances in Neural Information Processing Systems, vol. 30; 2017. p. 6348–6358.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref14] 14. Deudon M, Cournut P, Lacoste A, Adulyasak Y, Rousseau LM. Learning heuristics for the tsp by policy gradient. In: International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, vol. 10848; 2018. p. 170–181. https://doi.org/10.1007/978-3-319-93031-2_12
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref15] 15. Ma Q, Ge S, He D, Thaker D, Drori I. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. In: Association for the Advancement of Artificial Intelligence, 2019. https://doi.org/10.48550/arXiv.1911.04936
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref16] 16. Stohy A, Abdelhakam HT, Ali S, Elhenawy M, Hassan AA, Masoud M, et al. Hybrid pointer networks for traveling salesman problems optimization. Plos One. 2021; 16(12): e0260995. pmid:34905571
View Article
PubMed/NCBI
Google Scholar

[47] View Article

[48] PubMed/NCBI

[49] Google Scholar

[ref17] 17. Liu F, Zeng G. Study of genetic algorithm with reinforcement learning to solve the TSP. Expert Systems with Applications. 2009; 36(3): 6995–7001. https://doi.org/10.1016/j.eswa.2008.08.026
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref18] 18. Alipour MM, Razavi SN, Derakhshi MRF, Balafar MA. A hybrid algorithm using a genetic algorithm and multiagent reinforcement learning heuristic to solve the traveling salesman problem. Neural Computing and Applications. 2018; 30(9): 2935–2951.https://doi.org/10.1007/s00521-017-2880-4
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref19] 19. Costa PRdO, Rhuggenaath J, Zhang Y, Akcay A. Learning 2-opt heuristics for the traveling salesman problem via deep reinforcement learning. In: Asian Conference on Machine Learning, vol. 129; 2020. p. 465–480.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref20] 20. Zheng J, He K, Zhou J, Li CM. Combining reinforcement learning with Lin-Kernighan-Helsgaun algorithm for the traveling salesman problem. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35; 2021. p. 12445–12452. https://doi.org/10.1609/aaai.v35i14.17476
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref21] 21. Chaves AA, Lorena L. An Adaptive and Near Parameter-Free BRKGA Using Q-Learning Method. In: 2021 IEEE Congress on Evolutionary Computation; 2021. p. 2331–2338. https://doi.org/10.1109/CEC45853.2021.9504766
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref22] 22. Ottoni ALC, Nepomuceno EG, de Oliveira MS. A response surface model approach to parameter estimation of reinforcement learning for the travelling salesman problem. Journal of Control, Automation and Electrical Systems. 2018; 29(3): 350–359. https://doi.org/10.1007/s40313-018-0374-y
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref23] 23. Beom HR, Cho HS. A sensor-based navigation for a mobile robot using fuzzy logic and reinforcement learning. IEEE transactions on Systems, Man, and Cybernetics. 1995; 25(3): 464–477. https://doi.org/10.1109/21.364859
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref24] 24. Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, et al. Mastering the game of go with deep neural networks and tree search. Nature. 2016; 529(7587): 484–489. pmid:26819042
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref25] 25. Moravčík M, Schmid M, Burch N, Lisý V, Morrill D, Bard N, et al. DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science. 2017; 356(6337): 508–513. pmid:28254783
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref26] 26. Ontanón S, Synnaeve G, Uriarte A, Richoux F, Churchill D, Preuss M. A survey of real-time strategy game AI research and competition in StarCraft. IEEE Transactions on Computational Intelligence and AI in games. 2013; 5(4): 293–311. https://doi.org/10.1109/TCIAIG.2013.2286295
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref27] 27. Kober J, Bagnell JA, Peters J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research. 2013; 32(11): 1238–1274. https://doi.org/10.1177/0278364913495721
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref28] 28. Ng AY, Kim HJ, Jordan MI, Sastry S. Autonomous helicopter flight via reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 16; 2003.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref29] 29. Wei H, Zheng GJ, Yao HX, Li ZH. Intellilight: A reinforcement learning approach for intelligent traffic light control. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018. p. 2496–2505. https://doi.org/10.1145/3219819.3220096
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref30] 30. Wang D, Qi F. Trajectory planning for a four-wheel-steering vehicle. In: Proceedings 2001 International Conference on Robotics and Automation, vol. 4; 2001. p. 3320–3325. https://doi.org/10.1109/ROBOT.2001.933130
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref31] 31. Li MN, Qin ZW, Jiao Y, Yang YD, Wang J, Wang CX, et al. Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. In: the World Wide Web Conference, 2019. p. 983–994. https://doi.org/10.1145/3308558.3313433
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref32] 32. Akalin N, Loutfi A. Reinforcement learning approaches in social robotics. Sensors. 2021; 21(4): 1292–1328. pmid:33670257
View Article
PubMed/NCBI
Google Scholar

[98] View Article

[99] PubMed/NCBI

[100] Google Scholar

[ref33] 33. Rummery GA, Niranjan M. On-line Q-learning using connectionist systems. Cambridge: Cambridge University Press; 1994.

[ref34] 34. Watkins CJCH, Dayan P. Q-learning. Machine learning. 1992; 8(3–4): 279–292. https://doi.org/10.1007/BF00992698
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref35] 35. Zamfirache IA, Precup RE, Roman RC, Petriu EM. Reinforcement Learning-based Control Using Q-learning and Gravitational Search Algorithm with Experimental Validation on a Nonlinear Servo System. Information Sciences. 2021; 583: 99–120. https://doi.org/10.1016/j.ins.2021.10.070
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref36] 36. Huynh TN, Do DTT, Lee J. Q-Learning-based parameter control in differential evolution for structural optimization. Applied Soft Computing. 2021; 107: 107464–107477. https://doi.org/10.1016/j.asoc.2021.107464
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref37] 37. Liu Q, Zhai JW, Zhang ZZ, Zhong S, Zhou Q, Zhang P. A survey on deep reinforcement learning. Chinese Journal of Computers, 2018; 41(1): 1–27.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref38] 38. Mao GJ, Gu SM. An Improved Q-Learning Algorithm and Its Application in Path Planning. J. Taiyuan Univ. Technol. 2021; 52(1): 91–97.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref39] 39. Lowerre BT. The harpy speech recognition system. Carnegie Mellon University; 1976.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref40] 40. Zhang JF, Sun YT, Li GC, Wang YH, Sun JB, Li JX. Machine-learning-assisted shear strength prediction of reinforced concrete beams with and without stirrups. Engineering with Computers. 2020; 5: 1–15. https://doi.org/10.1007/s00366-020-01076-x
View Article
Google Scholar

[121] View Article

[122] Google Scholar

[ref41] 41. Kai-Wen L, Tao Z, Rui W, et al. Research reviews of combinatorial optimization methods based on deep reinforcement learning. Acta Automatica Sinica, 2021, 47(11): 2521–2537.
View Article
Google Scholar

[124] View Article

[125] Google Scholar

Figures

Abstract

1. Introduction

2. Principle and methodology of QL algorithm to solve TSP

2.1. Theoretical foundation

2.2. Problem mapping of QL algorithm for TSP

2.3. Construction of proposed QL algorithm

2.4. Analysis of results

3. QL-based algorithms for TSP

3.1. Self-adaptive beam search QL algorithm

3.1.1. Weighting function-based reward matrix.

3.1.2. Power function-based initial Q-table.

3.1.3. Self-adaptive ε-beam search strategy.

3.1.4. Improvement of Q-value update formula.

3.1.5. Analysis of results.

3.2. Dynamic sub-route-based ABSQL algorithm

3.2.1. Static sub-route optimization.

3.2.2. Dynamic sub-route optimization.

3.2.3. Analysis of results.

4. Experiments and analysis of results

5. Conclusion and future work

References