Hybrid pointer networks for traveling salesman problems optimization

In this work, we proposed a hybrid pointer network (HPN), an end-to-end deep reinforcement learning architecture is provided to tackle the travelling salesman problem (TSP). HPN builds upon graph pointer networks, an extension of pointer networks with an additional graph embedding layer. HPN combines the graph embedding layer with the transformer’s encoder to produce multiple embeddings for the feature context. We conducted extensive experimental work to compare HPN and Graph pointer network (GPN). For the sack of fairness, we used the same setting as proposed in GPN paper. The experimental results show that our network significantly outperforms the original graph pointer network for small and large-scale problems. For example, it reduced the cost for travelling salesman problems with 50 cities/nodes (TSP50) from 5.959 to 5.706 without utilizing 2opt. Moreover, we solved benchmark instances of variable sizes using HPN and GPN. The cost of the solutions and the testing times are compared using Linear mixed effect models. We found that our model yields statistically significant better solutions in terms of the total trip cost. We make our data, models, and code publicly available https://github.com/AhmedStohy/Hybrid-Pointer-Networks.


I. INTRODUCTION
Due to the many applications of the travelling salesman problem (TSP) in many areas, it has received significant attention from the machine learning community in the past years.However, the developed neural combinatorial optimization models are still in the infantry stage.Generalization is still an unresolved problem when it comes to dealing with a big number of points with high precision.The travelling salesman problem (TSP) is considered as one of the most significant and practical problems.Consider a salesman travelling to several areas, the salesman must visit each city just once while minimizing the total travel time.TSP is an NP-hard problem [2] , which addresses the challenge of finding optimal solution in polynomial time.Many approximation algorithms and heuristics, such as Christofides algorithm [3], local search [4], and the Lin-Kernighan heuristic (LKH) [5] have been developed to overcome the complexity of the exact algorithms which are guaranteed to yield an optimal solution but are frequently too computationally costly to be utilized in practice [6].The pointer network [7], a seq2seq model [8], shows great potential for approximation solutions to combinatorial optimization problems such as identifying the convex hull and the TSP.It uses LSTM [9] as the encoder and an attention mechanism [10] as the decoder to extract features from city coordinates.It then predicts a policy outlining the next likely city by selecting a permutation of visited cities.The pointer network model is trained using the Actor-Critic technique [11].
Moreover, attention model [12] and [13] influenced by the Transformer architecture [10] tried to address routing difficulties such as the TSP and VRP.Graph pointer networks extended the traditional pointer networks with an additional layer of graph embedding, this transformation achieved a better generalization for a largescale problem, but the GPN model without 2-opt still struggling for finding the optimal solutions for small and large scale.The work proposed in this paper begins with how the performance of graph pointer networks can be improved without changing much of the architecture; an extra encoder layer is added alongside the graph embedding layer to act as a hybrid encoder, and this gives the model the ability to achieve good results; this will be discussed in greater detail in the HPN section.Extensive results show that the proposed technique significantly outperforms previous DL-based methods on TSP.The learnt model is more successful than typical handcrafted rules in guiding the improvement process, and they may be further strengthened by simple ensemble methods.Furthermore, HPN generalize rather well to a variety of problem sizes, starting solutions, and even real-world datasets.It should be noted that the goal is not to outperform highly optimized and specialized traditional solvers, but to present a generalized model that can automatically learn good search heuristics on different problem sizes, which has a great value when applied to real-world problems.

II. Travelling salesman problem (TSP)
TSP is a classic example of a combinatorial optimization problem that has been used in data clustering, genome sequencing, as well as other fields.TSP problem is NP-hard, and several exact, heuristic, and approximation algorithms have been developed to solve it.In this paper, TSP problems are assumed to be symmetric.The symmetric TSP is regarded as an undirected graph.

Asymmetric vs. symmetric TSP
The distance between two cities in the symmetric TSP is the same in each opposite direction, producing an undirected network.This symmetry cuts the number of alternative solutions in half.Paths may not exist in both directions in the asymmetric TSP, or the distances may be different, resulting in a directed graph.

Directed vs. undirected graphs
Edges in undirected graphs do not have a direction.Each edge may be travelled in both directions, indicating a two-way connection.The edges of directed graphs have a direction.The edges represent a one-way connection, as each edge may only be travelled in one direction.A full undirected graph can be defined as ( , ) C V E = where V is the vector of vertices of the graph C, and E is the vector of edges between these vertices.In this study, the TSP's graph is complete so every node has an edge to each of the other vertices in the graph.is symmetric if ( , : In the context of this paper, eij equals the distance between the vertices i and j.Given a set V of cities n in a twodimensional space, the objective is to find the optimal Hamiltonian path that minimizes the total tour length [14]: Where 2 . is ℓ2 norm and  denote as a tour.

III. REINFORCMENT LEARNING (RL)
Reinforcement learning (RL) is the process of learning what to perform to increase a numerical reward signal.The agent isn't instructed which actions to perform but must experiment to determine which acts offer the greatest expected reward.To begin, we will define the notation used to represent the TSP as a reinforcement learning problem.
Let S be the state space and A the action space.Each state t s S is defined as the set of all previously visited cities.
The action t a A is defined as the next selected city from the set of possible cities, our model is considered a sequential one that given an instance t a (selected input city) outputs a probability distribution over the next candidates from the remaining cities that have not been chosen.We can define our policy as: From which we can sample to obtain a tour π.In order to train our model, we define the loss [12]: Where () L  is the cost of the tour that we are attempting to minimize.Recall the REINFORCE's [15] equation with baseline which is an extension from policy gradient algorithm [16]: (0.4) Where b(s) is the baseline subtracted from the cost to eliminate the policy gradient variance.The optimal baseline is one that lowers variation as much as possible while simultaneously speeding up the training process.As a result, we employ the approach given by [12]: :

IV. Hybrid Pointer Network (HPN)
HPN is inspired by the Graph pointer network (GPN).GPN is a modified variant of the classic pointer network (PN).
Graph pointer networks have been used to tackle TSP.Building on this approach, in this paper: • The graph embedding layer is combined with the transformer's encoder to produce multiple embeddings for the feature context.• An extra decoder layer is added to operate as a multidecoder structure network to improve the agent's decision-making process throughout the learning phase.
• Finally, we switch our learning algorithm from a central self-critic [17] to an actor-critic one as suggested by Kool [12].
An additional graph embedding layer is added above the pointer network, allowing the model to figure out the complicated relationships between graph nodes in largescale problems.However, it still struggles to find a globally optimal strategy for small and large TSP problems.This study proposes extending the network architecture to converge to a better policy for small, medium, and large sizes.The proposed HPN is shown in Figure 1.HPN consists of a mixture of several encoder's architecture and multi decoder based on the attention concept.

Hybrid Encoder
As illustrated in Figure 2, the proposed encoder consists of two parts: the hybrid context encoder, which encodes the Feature vector into two contextual vectors and the point encoder, which encodes the currently selected city by LSTM.Two different encoders are employed for the hybrid context encoder.The first encoder is a typical transformer encoder with multi-head attention and residual connection with batch normalizing layer, the transformer's encoder equations with a single head are [18]: ( ) W are learnable parameters,   is a matrix contains the encoded nodes,   ,   and   are a query, key and value of the self-attention.
The second one is the graph embedding layer.The graph embedding layer context is acquired by directly encoding the context vector obtained from coordinates of cities.Because we are only considering symmetric TSP, the graph is full.As a result, the graph embedding layer can be written as [1]: x of the LSTM is passed to both the decoder of the current stamp and the encoder of the next time stamp.

Multi-decoder
To begin the decoding phase, a placeholder is added for the first iteration of the decoding to select the best location to start the tour, the decoder is based on the attention mechanism of a pointer network and outputs the pointer vector u i , which is then sent through a Softmax layer to build a distribution across the following candidate cities.The attention mechanism and the pointer vector u i are defined as follows [1]: is the j-th entry of the vector i u , rq W and W are trainable parameters, q is the query vector from the hidden state of the LSTM, is a reference vector containing the contextual information from all cities.
The encoded context from the transformer's encoder is used as a reference for the first decoder layer and the context obtained from the graph embedding layer is used as the reference for the second decoder layer, as illustrated in Figure 1.
For determining the distribution policy across the candidate cities, four different operations can be employed for the aggregator: • The first option is to add the two attention vectors from each decoder layer, which are provided by: 12 ( | ) ( ) • The second option is to take the maximum value between these two vectors, which is indicated as follows: • The third option is to take the mean as follows: In the result section, we displayed the outcome for each one of them.

V. EXPERIMENTS
In our experiments.The training data is generated randomly from a  

Small-scale experiments
We begin our experiments with a difficult barrier: which aggregator function will assist our model in achieving better results?Indeed, it is difficult to answer this question without experimenting all of them; we examined the above-mentioned suggestions for this component and recorded the training performance results.Figure 3 illustrate these results.

Figure 3. Training performance for the actor (on Top) and the critic (on Bottom) where the total tour length on the y-axis and the number of epochs on x-axis indicating that when we apply the sum operation between the two attention's vectors, the model converges a little fast compared with the others
We can conclude from the above figure that the summation has excellent performance at first, but by the middle of training, the average has caught it, the maximum and the single-layer aggregation have a little higher result, so we decide to stop examining them.
For tackling the small-scale.We use TSP50 instances to train our HPN model.TSP50's average training time for each epoch is 19 minutes while utilizing one instance of NVIDIA Tesla P100 GPU.We compare the performance of our model on small-scale TSP to earlier studies such as Graph pointer networks, the Attention Model, the pointer network, s2v-DQN [19], the Transformer Network [18] and other heuristics, e.g.2-opt heuristics, Christofides algorithm and random insertion.
The results are shown in Figure 4 which compares the approximate tour length to the optimal solution.Small number indicates a better result.

Figure 4 Comparison of TSP50 results
As demonstrated in Figure 4 our HPN model surpasses the current models and achieves the state-of-the-art solution for TSP50.Our model outperforms the graph pointer network by a wide margin, increasing its performance for TSP50 from 5.959 to 5.706 without utilizing 2opt, which is a huge success for the hybridization concept.

Large-scale experiments
For achieving the best possible generalization out of our model, instead of just using the cities' coordinates as a context for both the graph encoder and the transform encoder, we replace it with a feature context that accelerates the training convergence for the large-scale problems.The feature context includes the vector context previously used by [1] concatenated with the Euclidean distance, where the vector context is just a subtraction operation between the coordinates of the currently selected city with the others.
Our feature extractor component does this job as illustrated in Figure 5.It is essential in the proposed HPN model since it extracts the most relative information and feeds it to the encoder as a context."Suppose that is a matrix with identical N rows.
We define i t X XX =− as the vector context.The j-th row of i X is a vector pointing from node i to node j and X is the matrix that contains coordinates'' of all cities.We expanded this notion by adding the Euclidean distance between the currently selected city and other cities to the vector context.Using the feature extractor, we train our large model in TSP50, validate with TSP500, 10 epochs, 1e-3 learning rate with leaning rate decay 0.96 and 100 for tanh clipping.Some sample tours are shown in Figure 6 in which we solve TSP50-250-500-100 with HPN+2opt.

TSP50
Table 2 summarizes our result which showing that our model generalizes better than GPN.For the sake of a fair comparison with the state-of-the-art (i.e., GPN), we used 2opt local search technique to fine tune the HPN's tours for larger sizes of instances.As shown in Table 2 and Figure 7, our models outperform the GPN, GPN+2opt, PN, AM, and 2opt models.Moreover, HPN+2opt returns near optimal tours and generalizes better than the GPN on a large-scale instance.

Table 2. TSP's result using Hybrid pointer network Model (HPN) vs baselines. Each result is obtained by averaging on 10k random TSP instances for TSP50
and 1K random instances for larger sizes.

Benchmark instances results and statistical analysis
To validate our model against the standard benchmark instances, we employed varied-size examples from the public libraries TSPLIB and World TSP .The benchmark dataset consists of 34 instances.The naming convention of instances consists of the first few letters of the instance location and the problem size n.For example, the instance eg7146 has 7146 points in Egypt.The instance sizes vary from 400 to 10639 nodes(cites).The normalized and actual tour length in km and the testing time in seconds are reported in Table 3.
To understand how HPN is performing compared to the GPN (the state-of-the-art network) in terms of the tour cost and testing time shown in Table 3, we did statistical comparison between these two networks.The statistical model should consider the dependency between the observations shown in Table 3.
In other words, we should realize that the tour cost of HPN and GPN for the same instance are correlated.Moreover, the testing times of same instance using the two networks are correlated as well.Therefore, we used a generalized linear mixed effects model to explain the variability of the tour cost and testing time as a function of the network used and the size of the network.We used one indicator variables to code the GPN and HPN.
In Table 4, we compared the tour cost for the HPN and GPN.The p-value of the GPN indicator variable is <.0001 and we conclude that the tour cost of the GPN is statistically significantly higher than the tour cost of the HPN.However, as shown in Table 5, the testing time of the HPN is statistically significantly higher than the testing time of the GPN  We repeat the same analysis for the tour cost and the testing time of the HPN+2opt and GPN+2opt.As shown in Table 6 and Table 7, the HPN+2opt has statistically significantly lower testing time and tour cost.This is expected because HPN returns a better tour as an initial point in the solution space which helps 2opt to find better final tour in less time.
Finally, for the sake of completeness we visualized four constructed tours using HPN + 2opt in Figure 8.

VI. CONCLUSION AND FUTURE WORK
In this work, a hybrid pointer network (HPN) is proposed for tackling both small-scale and large-scale problems.We demonstrate that the hybrid concept with a graph-based method is successful in improving the model performance for both scales.We used REINFORCE with Rollout Baseline to train our model.Our results show that our model outperforms the traditional graph pointer network with a significant margin, resulting in improved model generalization.In the future, we will attempt to find a robust architecture to improve the quality and the time of solutions for large-scale problems, resulting in better model generalization.We also want to tackle combinatorial problems with constraints, which will be an important direction for future study.

Algorithm 1 .
REINFORCE with Rollout Baseline [12] 1 : input: number of epochs E, steps per epoch T, batch size B

Figure 1 .
Figure 1.Architecture of HPN which combining a hybrid context encoder with a multi-attention decoder.

Figure 2 .
Figure 2. Hybrid encoder consists of Transformer's encoder and Graph embedding layer as a hybrid context encoder (blue dotted box) and the point encoder (red dotted box) for the current city.

2 0, 1
uniform distribution.In each epoch, the training data is generated on the fly.The hyperparameters provided in

Figure 5 .
Figure 5. Architecture of feature extractor which combining both vector context with the Euclidian distance and output a Feature vector.The Euclidean distance between two points ( , ) ( , ) i i j j x y and x y is shown in (0.10): 22 ) ( , ) ( , i i j j x d y x y + = (0.10)Where x and y are the coordinates of each city.

Figure 7
Figure 7 Large-scale results from GPN, HPN, GPN+2opt, and HPN+2opt demonstrate that the gap between our model and GPN increases as the problem size increases.

Figure 8
Figure 8 Sample tours for benchmark instances the model to decide how to aggregate them; we can describe this notion as follows:   == • The final option is to concatenate both of them and feed the concatenated vector into a single embedding layer, letting

Table 1
are used in the following experiments.