Figures
Abstract
Leveraging the Transformer architecture to develop end-to-end models for addressing combinatorial optimization problems (COPs) has shown significant potential due to its exceptional performance. Nevertheless, a multitude of COPs, including the Traveling Salesman Problem (TSP), displays typical graph structure characteristics that existing Transformer-based models have not effectively utilized. Hence, this study focuses on TSP and introduces two enhancements, namely closeness centrality encoding and spatial encoding, to strengthen the Transformer encoder’s capacity to capture the structural features of TSP graphs. Furthermore, by integrating a decoding mechanism that not only emphasizes the starting and most recently visited nodes, but also leverages all previously visited nodes to capture the dynamic evolution of tour generation, a Transformer-based structure-aware model is developed for solving TSP. Employing deep reinforcement learning for training, the proposed model achieves deviation rates of 0.03%, 0.16%, and 1.13% for 20-node, 50-node, and 100-node TSPs, respectively, in comparison with the Concorde solver. It consistently surpasses classic heuristics, OR Tools, and various comparative learning-based approaches in multiple scenarios while showcasing a remarkable balance between time efficiency and solution quality. Extensive tests validate the effectiveness of the improvement mechanisms, underscore the significant impact of graph structure information on solving TSP using deep neural networks, and also reveal the scalability and limitations.
Citation: Zhao C-S, Wong L-P (2025) A transformer-based structure-aware model for tackling the traveling salesman problem. PLoS ONE 20(4): e0319711. https://doi.org/10.1371/journal.pone.0319711
Editor: Yangming Zhou, Shanghai Jiao Tong University - Xuhui Campus, CHINA
Received: October 11, 2024; Accepted: February 5, 2025; Published: April 7, 2025
Copyright: © 2025 Zhao and Wong. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting information files.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The combinatorial optimization problem is a specific kind of optimization problem that aims to find the optimal value within a set of discrete states. It exists widely in diverse fields, such as national security, transportation, industry, and daily life. The main approaches to solving COPs are traditional operational optimization methods, including exact methods, approximation methods, and heuristics. However, with the ongoing expansion of problem scales in practical applications and the increasing requirement for real-time answers, traditional approaches face significant computing burdens, which pose a challenge to realizing real-time solving for COPs [1–4].
Vinyals et al. [5] drew a parallel between solving COPs and the machine translation process, that is, sequence-to-sequence [6] mapping. They modified the classical encoder-decoder [7] model and proposed the Pointer Network (Ptr-Net), successfully applying deep learning to solve COPs, such as computing Delaunay triangulations, identifying planar convex hulls, and solving planar TSP. Trained with supervised learning, Ptr-Net takes the feature sequence (such as the sequence of city coordinates of the TSP) as input and directly outputs the solution (such as the visiting order of cities) through a simple forward propagation computation. Moreover, it has the ability to address any TSP that shares distributional characteristics similar to those observed in the training data. In contrast to traditional approaches, this end-to-end method does not necessitate iterative searches, exhibiting both fast solving speed and strong generalization ability. This breakthrough paved the way for a promising avenue of study within the domain of combinatorial optimization.
Since the proposal of the Ptr-Net, various innovative approaches aimed at improving the encoder, decoder, or training strategy have been introduced at top-tier conferences such as NeurIPS and ICLR in recent years. These methods have achieved outstanding optimization outcomes on COPs like TSP, the Vehicle Routing Problem (VRP), the Knapsack Problem, the Orienteering Problem (OP) and others. Bello et al. [8] adopted reinforcement learning [9] as a substitute for supervised learning to train Ptr-Net and applied it to solve the TSP and the Knapsack Problem. Their model outperformed that of Vinyals et al. As reinforcement learning tackles the difficulty of obtaining labeled samples for supervised learning in the context of COPs, it has become the dominant training approach for neural combinatorial optimization models. Since the order of data input in COPs, such as TSP and VRP, should not affect the solutions, Nazari et al. [10] simplified the Ptr-Net encoder by replacing the long short-term memory (LSTM) [11] layer with a 1D convolutional layer and significantly reduced the computational complexity of their model, saving approximately 60% training time on TSPs and outperforming classical heuristics on VRPs. Furthermore, Ma et al. [12] introduced graph neural networks (GNN) [13] to enhance the capabilities of the encoder and proposed the graph pointer network (GPN) to solve TSP and TSP with time windows. Their model exhibits notable generalizability in tackling larger-scale problem instances from synthetic datasets.
In recent years, the Transformer [14] model has achieved significant successes in the field of natural language processing (NLP) [15]. Based on the multi-head attention (MHA) [14] mechanism, the Transformer encoder exhibits a powerful capability to extract deep features from data. Given this, several latest studies have employed the Transformer to address COPs. Deudon et al. [16] used the encoder proposed in [14] to compute the feature vectors for cities in TSP. Their decoder linearly maps the cities visited by the last three steps into a query vector and utilizes the same attention mechanism as the Ptr-Net to predict the next city. Their network is trained with REINFORCE [17] in conjunction with stochastic gradient descent (SGD) with instances generated dynamically. Enhanced by a straightforward 2-opt procedure, their model achieves outcomes that are almost optimal on the TSP. Kaempfer & Wolf [18] trained their model based on the Transformer architecture to construct a fractional solution for multiple TSP and then employed beam search to convert that fractional solution into a feasible integer solution. Based on Transformer, Kool et al. [19] proposed a model that is capable of solving a wide range of COPs. They also adopted an encoder similar to the standard Transformer encoder to embed the nodes. However, they modified the decoding and training methodologies. At each decoding step, the decoder predicts the next node through an attention mechanism with a context that consists of the global representation of all cities, the first visited node and the previously visited node. Training was carried out with reinforcement learning and enhanced with the rollout baseline [19] mechanism. This model surpasses all the previously mentioned end-to-end models [5,8,10,16] on problems such as TSP, Prize Collecting TSP (PCTSP), Capacitated Vehicle Routing Problem (CVRP) and OP. Its performance approaches that of professional solvers such as Gurobi [20], LKH3 [21], and Concorde [22]. Bresson et al. [23] employed the Transformer encoder architecture to address the TSP. They modified the decoder and trained their model with reinforcement learning. Their model achieves outcomes extremely close to optimal and elevates learned heuristics to a remarkable high level.
Despite the impressive achievements of Transformer-based models, Deudon et al., Kool et al., and Bresson et al. all utilized the default Transformer encoder architecture. These models take city coordinates as sequential input and employ the Transformer’s MHA mechanism to calculate the semantic similarity between nodes. However, they do not account for the graph structure information inherent in the nodes and their relationships. Problems such as TSP and VRP exhibit typical structural characteristics of graphs. Studies conducted by Dai et al. [24], Mittal et al. [25], Li et al. [26], and others [27–30] have indicated that incorporating the structural information of graphs is beneficial for solving these problems. Therefore, we believe that properly integrating the structural information of graphs with Transformer will bolster model performance. Ying et al. [31] investigated strategies to improve the efficacy of the Transformer model within the domain of graph representation learning. They introduced three graph structure encodings (edge encoding, degree centrality encoding, and spatial encoding) to effectively encode the structural information of a graph into Transformer and proposed the Graphormer [31]. The remarkable capacity of Graphormer results in achieving state-of-the-art performance across a diverse spectrum of tasks in practical applications. Inspired by the Graphormer model, this study enhances the standard Transformer model and proposes an improved encoder architecture to capture the positional relationship features of the nodes in TSP. Furthermore, in conjunction with a modified decoding mechanism, this study proposes an end-to-end model based on the Transformer for solving TSP. Following rigorous training with reinforcement learning, the proposed model exhibits notable proficiency in tackling TSPs of various scales.
The primary contribution of this study lies in the exploration of effective strategies to leverage graph structure information to enhance the performance of Transformer-based end-to-end models in addressing TSP. Ultimately, an enhanced Transformer-based model for solving TSP is proposed, utilizing the devised encoding and decoding mechanisms. The description of the encoding and decoding mechanisms is as follows:
- Two graph encoding mechanisms are designed for the Transformer to incorporate TSP graph structure information. Specifically, the Transformer encoder is enhanced for TSP, featuring closeness centrality [32] encoding for node importance and spatial encoding for spatial relationships of TSP nodes. Empirical studies confirm the effectiveness of these designs in improving the Transformer in TSP graph modeling.
- A decoding mechanism is proposed for TSP that not only places emphasis on both the starting node and the previously visited node, but also captures the dynamic expansion of the tour by incorporating all nodes visited so far. Empirical studies demonstrate its effectiveness in improving model training and model performance.
After being trained with reinforcement learning, the proposed model achieves deviation rates of 0.03%, 0.16%, and 1.13% for TSPs of 20-node, 50-node, and 100-node, respectively, compared to the results generated by the Concorde solver, a well-established benchmark for TSP solutions.
Preliminary
The process of solving TSP based on the Transformer could be summarized as the mapping from an input sequence to an output sequence facilitated by a deep neural network. The model follows the classic encoder-decoder architecture and can be divided into four core modules: input, output, encoder, and decoder. Here we take the 2D Euclidean TSP as an example to introduce the basic principle. The overall architecture of the model and the solving process are illustrated in Fig 1.
TSP: Given a set of cities , the TSP involves determining the minimum cost tour
that visits each of the n cities exactly once and returns to the starting city. Here, πk denotes the permutation of cities that defines their visiting order in the tour. The cost of a tour is defined as the sum of the distances traveled between consecutive cities along the route.
Input: Each city vi is characterized by its coordinates (xi,yi) in a 2D Euclidean space. The 2D coordinate sequence of the cities serves as the input for the model.
Output: Tour .
Encoder: The function of the encoder is to derive a contextualized representation for each city. The attention encoder employed is similar to that used in Vaswani et al.’s Transformer [14], but the positional encoding is omitted ensuring that the node embeddings are unaffected by the input data order.
The encoder takes the 2D coordinates vi(xi,yi) of n cities as input, then computes initial d-dimensional (such as d = 128 or 256 etc.) node embeddings by means of a learned linear projection characterized by parameters Wx and bx:
. Following batch normalization (BN) [33], the embeddings
undergo a sequence of N attention layers. Each attention layer comprises two sublayers: an MHA layer, responsible for facilitating message passing among nodes, and a feed-forward (FF) layer composed of two position-wise linear transformations separated by a ReLU activation. Each sublayer incorporates a skip-connection [34] and a batch normalization. The procedure is formulated as:
ℓ indicates the attention layer index. The output of the previous attention layer serves as the input of the next encoding layer. Current studies such as [16,19] prefer to employ N = 3 attention layers and hidden layer dimensions of 512, aiming to strike a favorable balance between high-quality results and computational efficiency.
The MHA sublayer performs self-attention computations across M distinct learned subspaces, thus earning its designation as “multi-head”. This process yields M representations (each with dimensionality d ∕ M) for each city. Within each head, the representation of each city is derived as a weighted aggregation of city embeddings, with the weights determined by an affinity function operating on queries Q and keys K associated with the cities. Subsequently, these representations are concatenated and projected back to a d-dimensional representation. The self-attention mechanism is formally defined as
where Q=WQH, K=WKH, V=WVH, and . The matrices WQ, WK, and WV are all parameters that should be learned through training.
Decoder: The role of the decoder is to utilize an attention mechanism to progressively construct a solution in an autoregressive manner. Autoregressive means that at each step the previously generated symbol is used as an additional input to generate the next until a complete solution is constructed.
Given the presence of several effective decoding mechanisms [16,19,23], here we take the mechanism proposed by Deudon et al. [16] as an illustration, which is shown in Fig 1. At each decoding step t, the last three visited cities are aggregated into a query vector qt:
W1, W2, and W3 are learnable parameters. The query qt interacts with the encoder output to calculate a probability distribution to identify the city to visit next. This pointing mechanism is parameterized by three crucial factors: two attention matrices labeled Wref and Wq, alongside an attention vector denoted by v. The formulation is articulated as follows:
A mask is employed to assign a value of − ∞ to the log-probabilities associated with cities already visited within the tour, as elucidated by Eq (4). This operation is essential to ensure that the decoder generates admissible permutations of the input. Subsequently, the logits undergo a bounding operation within the interval [ − C , + C ] to regulate entropy, while a temperature hyperparameter T is introduced to govern the level of confidence in the sampling process, as shown in Eq (5). According to the pointing probabilities , the next city is selected by sampling. Upon selection of the subsequent city, the query qt+1 will be revised to include the selected city, and the autoregressive process will be completed upon completion of the tour.
Proposed model
In this study, adhering to the classical encoder-decoder structure, a model tailored for 2D Euclidean TSP is proposed based on the Transformer. For a TSP instance denoted as s, which represents a fully connected graph comprising n nodes, the encoder maps the input set into a set of continuous representations
. Given h, the decoder then autoregressively generates a permutation
of the input s, which serves as the final solution.
Encoder
Our encoder inherits the overall structure of the encoder depicted in Fig 1. However, as emphasized in our introduction, there exists a potential in developing methodologies to exploit the structural information inherent in TSP graphs. Therefore, two effective encoding modules are introduced into our encoder to enhance its structure-aware ability. The mechanisms are illustrated in Fig 2.
Closeness centrality encoding.
In contemporary Transformer-based models, the computation of the attention distribution is predominantly based on semantic correlations among nodes, neglecting the insightful information offered by node centrality. Node centrality is a fundamental metric that indicates the importance of a node within a graph. The Graphormer incorporates the classical degree centrality [31] as an additional signal to enhance the Transformer. As this model focuses on fully connected TSPs, where all nodes have the same degree, degree centrality is no longer a significant feature. Consequently, closeness centrality is adopted as an alternative metric in this study to measure the importance of each node and is integrated into the Transformer. More specifically, a centrality encoding mechanism is developed that assigns a real-valued embedding vector z to each node based on its closeness centrality. This vector is then concatenated with the embedding of the node’s coordinates, constructing the initial input vector of the encoder:
where represents the embedding of the coordinates (xi,yi),
denotes the learnable embedding vector associated with the closeness centrality cen(vi), and | | represents the concatenation operation. Integrating centrality encoding into the input allows the attention mechanism to effectively capture signals indicating node importance in both keys and queries. This integration empowers the model to simultaneously capture semantic correlations and prioritize node importance within the attention mechanism.
Spatial encoding.
The conventional Transformer [14] model incorporates positional encodings into input embeddings to leverage relative or absolute position information from sequential data tokens. However, this approach is not directly applicable to TSP. Unlike typical sequential data, where tokens are arranged in a linear fashion, TSP instances involve nodes situated in a 2D spatial space. These nodes are interconnected by edges, forming a graph structure that deviates from the linear sequence assumed by the conventional Transformer model. The input order of the TSP nodes does not accurately reflect their true positional relationships. As a result, when applying the Transformer to solve TSP, researchers commonly eliminate the positional encoding module to mitigate the influence of the input order of the nodes on final solutions. However, this approach neglects the spatial positional relationships between nodes, preventing models from leveraging this valuable information. In the TSP context, the spatial arrangement of the nodes directly affects the travel distances between the nodes, which is a critical factor in scouting the shortest feasible path. Nodes in close spatial proximity are more likely to be connected by edges within the solution. Recognizing this, spatial information becomes vital for various heuristic algorithms [35] and optimization techniques [36,37], aiding in the search for the shortest tour. To incorporate the structure information of the TSP nodes into our model, the Graphormer spatial encoding mechanism is employed. When applied to a given graph G, Graphormer defines a function that quantifies the spatial relationship between the nodes vi and vj within the graph G. This function ϕ is typically defined based on the connectivity between the nodes in the graph. For this study, ϕ(vi,vj) is specifically defined as the Euclidean distance between nodes vi and vj. Each output value is then assigned a learnable scalar that functions as a bias term within the self-attention module. The ( i , j ) -element of the attention query-key product matrix A at Transformer encoding layer ℓ is denoted as
and is formulated as
The represents a learnable scalar indexed by
. It is shared across all our Transformer encoding layers. The utilization of
enables each node within each encoding layer to dynamically adjust its attention to other nodes based on the structural information of the graph. For example, if
is learned to be inversely related to the value of
, the model will tend to allocate greater attention to the nearby nodes while diminishing attention to those farther away.
Following the encoding process delineated in Fig 1, our encoder receives the expanded embeddings , which encapsulate centrality information of nodes, and executes multiple layers of MHA encoding employing the attention mechanism enhanced by spatial encoding. This process ultimately yields node encodings with graph structure features.
Decoder
The decoding process unfolds sequentially. At each time step t ∈ { 1 , 2 , … , n } , the decoder produces an output node πt leveraging both the encoding results generated by the encoder and the partial tour π1:t−1. The decoding mechanism adopted in this study is similar to the approach proposed by Kool et al. [19]. Specifically, it introduces a designated context node, denoted c, to encapsulate the decoding context. This context node is subsequently employed as a query, while the encoder output serves as reference, thereby facilitating the prediction of the probability distribution of the next node through a single-head attention (SHA). To enhance model performance, an additional attention layer for the context node, named glimpse [5,8], is incorporated atop the encoder. This layer aggregates contributions from various nodes in the input sequence, contributing to overall model improvement.
Our strategy for constructing the context node c differs from that of Kool et al. They utilize the mean of the encoding of all nodes as a graph token and combine it with the starting node and the previously selected node to construct the context node. For TSP, since the newly selected node πt must be directly linked to the previously chosen node πt−1 to form a tour, each decoding step is highly dependent on the previously visited node πt−1. In addition, the path must return to the starting point (the node visited in the first step). Therefore, including the starting node and the previously selected node in the context node to ensure continuous attention to these critical nodes is an effective mechanism [38,39]. This study adopts this mechanism in the construction of the context node. However, the graph token utilized by Kool et al. encapsulates global features of the input sequence and remains static throughout the entire decoding process. This static nature of the graph token limits its ability to capture dynamic changes in the decoding context [39]. Specifically, it overlooks the influence of other nodes, such as visited nodes and candidate nodes, on the decoding decisions. Luo et al. [38] addressed a similar issue in their research by utilizing the starting node, the destination node, and the available nodes, thus achieving dynamic learning and enabling their model to adjust and refine the relationships between the starting/destination nodes and the available nodes. The available nodes employed in [38] consist of a set of unvisited nodes. In fact, the impact of visited nodes on the decoding process is also significant and has been emphasized in multiple studies, such as those of Bello et al. [8], Deudon et al. [16], and Bresson et al. [23]. Therefore, this study calculates the mean of the encodings of all visited nodes to create a token representing the latest partial tour, which is different from the mechanism employed in [38]. This token is then combined with the nodes visited in the first and previous steps to construct the context node for our decoder. At each time step, our context node not only emphasizes the significance of both the first and the most recently visited nodes, but also conveys information regarding all previously visited nodes to the decoder. A more detailed illustration of the decoding process is illustrated in Fig 3.
This example illustrates how a tour π = ( 2 , 1 , 3 , 4 ) is generated based on the embedding results (h1,h2,h3,h4) obtained from the encoder. Best viewed in color.
Constructing the context node.
For t = 1, we calculate the mean of the encodings of all nodes and obtain the graph token . The graph token combines with two learnable d-dimensional placeholders Vf and Vl, yielding context node
. This context node is employed to launch the decoder to predict the starting node. For t > 1, context node
is built with the embeddings of the first visited node hπ1, the previously visited node hπt−1, and the mean of all visited nodes
:
Here, [ ⋅ , ⋅ , ⋅ ] denotes the horizontal concatenation operator.
Glimpsing with MHA.
At each decoding step t, a single glimpse using M-head attention is performed to compute a context node encoding . This mechanism is inspired by the work of Bello et al. [8]. For each head, the attention keys and values are derived from the node encodings hi generated by our encoder, and the query
is generated from the context node
:
, and
are all learnable parameters. At time t, the compatibility
between the query
and all nodes is calculated as follows:
Here, denotes the compatibility score between
and node i at time t, and dk=d∕M is the dimensionality of the keys and queries. The assignment of
is utilized to mask nodes that have already been visited. Based on compatibilities
, the attention weights
are derived through a softmax. For each head m ∈ [ 1 , M ] , the vector
received by context node
is generated as a weighted combination of values vi, denoted by:
Finally, a single d-dimensional context vector is obtained by projecting these vectors utilizing parameter matrices
:
Calculating probabilities.
The probabilities are calculated by an independent decoder layer equipped with a single-head attention. Firstly, the compatibility
is computed according to the context vector
. Then, following Bello et al. [8], a tanh clipping is applied to the results (prior to masking) to constrain them within the range [ − c , c ] :
These compatibilities are interpreted as unnormalized log-probabilities. The ultimate selection probability of node i at time step t, denoted as , is computed using the softmax function:
Finally, based on the probability distribution, the decoder employs a greedy or sampling strategy to select a subsequent node to visit.
Training the model
Following studies [8,16,19,23], a model-free policy-based reinforcement learning approach is employed to refine the parameters (denoted as θ) of our neural network. The deep neural network developed in this study defines a stochastic policy that maps the input city sequence s to the output sequence τ. At step t, considering the visited cities π1:t−1 and the city sequence s, the probability of the subsequent city πt is
. Utilizing the chain rule, the probability of an entire tour τ can be deduced as:
Given an input set of cities s, the fundamental concept involves assigning elevated probabilities to preferred tours while minimizing probabilities for less desirable ones. As the quality of a tour is evaluated by its length L ( τ ) (which is endeavored to minimize), for a given problem instance s, the training objective is defined as the expected tour length:
To endow the policy with generalization capabilities, the training instances should be sampled from a distribution S. Consequently, the overarching training objective is formulated as:
The policy gradient method along with SGD is adopted to iteratively update the model parameters. The gradient of Eq (17) could be formulated using the established REINFORCE algorithm [17] as:
where b(s) represents a baseline function utilized to estimate the anticipated tour length of the input sequence s, and L ( τ ∣ s ) − b ( s ) determines the direction of gradient updates.
By drawing B i.i.d. sample instances and sampling one solution for each, i.e.
, the gradient of Eq (18) could be approximated by Monte Carlo sampling as follows:
A robust baseline b(s) is essential to mitigate the variance in the gradient and accelerate the training process. In this study, the greedy rollout baseline method introduced by Kool et al. [19] is adopted as a substitute for employing an additional critic network to approximate b(s) to improve training. For each training epoch, following a methodology akin to the approach employed in the deep Q-network (DQN) [40] where the target Q-network is frozen after each training epoch, the best policy achieved up to the current epoch is preserved as the baseline policy denoted as . The baseline b(s) estimates the tour length
of the solution τBL, which is derived from the deterministic greedy rollout of the baseline policy. If the current training policy pθ surpasses the baseline, the tour length L ( τ ) of the sampled solution τ based on pθ will be shorter than b(s), and L ( τ ) − b ( s ) is negative, resulting in the current policy being reinforced, and vice versa. After the training of each epoch, the performance of both the new policy and the baseline is compared on a validation dataset with a size of 10,000 instances. If the new policy outperforms the baseline and the enhancement demonstrates statistical significance determined by a paired t-test [41] with a predetermined significance level of α = 5%, the baseline parameters θBL are updated with newly learned parameters θ. As a result, this approach ensures that our model is consistently challenged by the most superior model achieved thus far, aligning with the self-critical training principles proposed by Rennie et al. [42]. To prevent overfitting, once the baseline is updated, the validation dataset is refreshed. As the baseline model periodically updates and saves the best model, the latest baseline model is employed as the final output model. The algorithmic procedure can be located in Algorithm 1 as detailed in reference [19].
Experiments
This study focuses on the TSP and systematically evaluates the efficacy of the proposed model across three benchmark tasks: Euclidean TSP20, TSP50, and TSP100. The experiments are implemented using Ubuntu 20.04, Python 3.8, and PyTorch 1.11.0. Consistent hyperparameters are employed for all problems.
Hyperparameters
For all three problems, all datasets are uniformly sampled from the 2D unit square [0,1]2. The size of the training dataset is 1,280,000, while the validation and test datasets each consist of 10,000 instances. The mini-batch size is defined as 512; thus, the parameters are updated 2,500 times per epoch. Early stopping criteria are applied with a threshold set at 20 epochs. The two coordinates and the closeness centrality of each city are individually embedded in 128-dimensional spaces and then concatenated as a d = 256 combined embedding. The attentive encoder consists of 3 stacks, each equipped with M = 8 parallel heads, and the hidden dimensionality is d = 256. For each head, the dimensionality is specified as . Both the input and output dimensions of the feed-forward sublayer are configured to be 256, featuring an inner layer with a dimensionality of 512 and employing a ReLu activation. The pointing mechanism operates within a 256-dimensional space. Model parameters θ are initialized according to a uniform random distribution within the interval of
. The tanh logits for the pointing mechanism are clipped to the interval [ − 10 , 10 ] . Adam [43] is utilized as the optimization algorithm in conjunction with SGD. The learning rate
is fixed for the first 100 epochs to speed up initial learning and then decayed every 2 epochs by 0.98. In the first epoch, an exponential baseline ( β = 0 . 8 ) is adopted as the rollout baseline to warmup the learning. The significance threshold is set to α = 5% for the paired t-test during the selection of the baseline.
Training
The proposed model is trained individually to solve TSP20, TSP50, and TSP100. Each training session is conducted exclusively on a single NVIDIA GeForce RTX 3090 (24G) graphics card, with 200 training epochs. The average duration of each epoch is 217, 525, and 1,330 seconds for TSP20, TSP50, and TSP100, respectively. Following each epoch of training, the learned policy is evaluated using a validation dataset, employing the greedy mechanism. The loss is quantified as the mean tour length of the validation dataset. The evolution of our model’s learning progress over time on problems of different sizes is illustrated in Fig 4.
As shown in the figure, the model exhibits relatively consistent convergence trends and processes across problems of different scales. Initially, each model starts with a higher loss. As training progresses, the losses decrease steadily, converging towards stable minima. This trend indicates that the models are effectively exploring and refining their policies. The curves show a sharp decline before they begin to plateau, suggesting that our models rapidly comprehend and adapt to the complexities of different TSP instances. Additionally, while minor irregularities are observed, the consistent progression towards lower losses confirms the effectiveness of the training methodology.
Results and discussion
The performance of the proposed model in tackling TSP20, TSP50, and TSP100 is assessed through the mean tour length of datasets consisting of 10,000 distinct test instances for each problem size, respectively. The test datasets are constructed in accordance with the methodology adopted by Kool et al. [19] and serve as the benchmark for our comparative study. During inference, two distinct decoding strategies are utilized. The first strategy, known as greedy decoding, sequentially selects the city with the highest probability during decoding, generating a single solution. The second strategy, called sampling, generates 1,024 candidate solutions for each problem instance, from which the best solution is selected. The performance of the proposed model under greedy decoding is systematically contrasted with baselines that generate singular solutions. Simultaneously, the performance of the proposed model employing the sampling strategy is rigorously evaluated against baselines that employ sampling or search methodologies to explore multiple solutions. For each problem instance, the result obtained by Concorde is reported as the benchmark for calculating the optimality gaps, which are expressed as a percentage. Furthermore, the approaches compared with ours are categorized into two groups: non-Transformer-based and Transformer-based.
Comparison with non-Transformer-based methods
Non-Transformer-based comparative methods include: (1) classic heuristics, including nearest neighbor, nearest insertion, random insertion, farthest insertion, and Clarke Wright savings [44] algorithms; these algorithms are non-learned baseline algorithms that construct a solution sequentially, resembling the structure of the proposed model; (2) OR Tools, a powerful solver developed by Google for routing problems; (3) deep learning methods grounded in construction and improvement heuristics, involving both reinforcement [8,24] and supervised [5,28] learning methods. We implemented the classic heuristics in Python. For OR Tools, we develop a Python script specifically for solving TSP, adhering to the TSP-solving guidelines provided by OR Tools and adopting its constraint solver. We utilize default search parameters, and the initial solution strategy is configured to PATH_CHEAPEST_ARC. All implementations are evaluated on our test datasets. Their optimal gaps are computed on the basis of the results of Concorde. For deep learning methods proposed by Bello et al. [8] and Joshi et al. [28], we refer directly to the comprehensive experimental results provided in their original studies. Meanwhile, the experimental results for the models introduced by Vinyals et al. [5] and Dai et al. [24] are obtained from the study by Kool et al.’s [19].
Table 1 clearly illustrates the efficacy of the proposed model employing the greedy decoding strategy to solve TSPs of various sizes. The model achieves an optimality gap of only 0.10% on TSP20, closely approximating the performance of the Concorde solver. This gap increases modestly to 0.81% for TSP50 and to 2.72% for TSP100, demonstrating strong performance even as the complexity of the problem increases. Compared with classic heuristic methods, such as nearest neighbor, nearest insertion, random insertion, and farthest insertion, the proposed model demonstrates significant advancements in solution quality. Among these methods, the farthest insertion achieves the best results on TSP20 and TSP50, with optimality gaps of 2.53% and 7.29%, respectively, while random insertion performs the best on TSP100, achieving a gap of 9.62%. The proposed model surpasses these results with reductions of 2.43%, 6.48%, and 6.90% on TSP20, TSP50, and TSP100, respectively, underscoring its ability to outperform the most commonly applied traditional heuristics. The comparison with the Clarke Wright savings algorithm, a more advanced and effective heuristic, further highlights the model’s effectiveness. Clarke Wright savings achieve optimality gaps of 3.78%, 6.65%, and 8.27% for TSP20, TSP50, and TSP100, respectively. The proposed model improves upon these results by 3.68%, 5.84% and 5.55%, demonstrating its capability to address both general and advanced heuristic approaches. The OR Tools solver also delivers competitive results with optimality gaps of 0.73%, 2.67%, and 3.75% for TSP20, TSP50, and TSP100. Despite its strong performance, the proposed model achieves lower gaps, reducing optimality gaps by 0.63% on TSP20, 1.86% on TSP50 and 1.03% on TSP100. Finally, compared to non-Transformer-based deep learning methods that greedily construct solutions, the proposed model consistently outperforms state-of-the-art alternatives, including those proposed by Vinyals et al., Bello et al., Dai et al., and Joshi et al. Among these comparative methods, Joshi et al. achieve the smallest optimality gaps of 0.60% and 3.10% on TSP20 and TSP50, while Bello et al. perform best on TSP100 with an optimality gap of 6.90%. However, the proposed model outperforms these approaches in all three problem sizes, reducing the optimality gaps by 0.50%, 2.29%, and 4.18%, respectively.
From the perspective of computational time, the proposed model demonstrates competitive efficiency when executed in greedy mode. Specifically, it requires 0.016, 0.037, and 0.076 seconds to solve TSP20, TSP50, and TSP100 instances, respectively. Compared to classic heuristic methods, such as nearest neighbor, nearest insertion, random insertion, and farthest insertion, the proposed model generally consumes slightly more time. However, this marginal increase is offset by substantial improvements in solution quality. In particular, as the problem size increases, the efficiency gap between the proposed model and these traditional heuristic methods narrows. For example, the model solves a single TSP100 problem in 0.076 seconds, which is faster than the 0.082 seconds required by random insertion and the 0.103 seconds by farthest insertion. When compared with Clarke Wright savings and OR Tools, the proposed model demonstrates significantly superior efficiency on TSP50 and TSP100, while consistently achieving improved solution quality across all problem scales. Unfortunately, for non-Transformer-based deep learning methods, a direct and fair comparison is not possible because some studies (Vinyals et al. and Dai et al.) do not report their time data, while others (Bello et al. and Joshi et al.) do record time data, but these are based on differing software and hardware platforms. Nevertheless, despite these limitations, the available results reveal notable trends. Bello et al.’s model demonstrates the best computational efficiency. However, its solution quality remains significantly inferior to that of the proposed model. Compared to Joshi et al., even when accounting for platform differences, the computational efficiency of the proposed model is on a completely different level, solving instances several orders of magnitude faster.
Table 2 presents a comparative analysis of different approaches that utilize a sampling strategy to solve TSPs across three different problem sizes. Sampling is a widely employed technique for improving solution quality. By conducting 1,024 samples, the proposed model further reduces the optimality gap to 0.03% on TSP20, 0.16% on TSP50, and 1.13% on TSP100. These results demonstrate the outstanding capability of the proposed model in achieving near-optimal solutions on various problem scales. Although sampling improves the performance of OR Tools over its greedy mode, the proposed model still markedly outperforms it, achieving optimality gap reduction of 0.34%, 1.67%, and 1.77% for TSP20, TSP50, and TSP100, respectively. When compared to Joshi et al.’s model augmented with beam search, our model demonstrates consistently and significantly enhanced solution quality across TSP20, TSP50, and TSP100. For instance, on TSP100, where Joshi et al.’s model achieves an optimality gap of 2.11%, the proposed model achieves a reduced gap of 1.13%, indicating an improvement of 0.98%. Similarly, when evaluated against the approaches of Bello et al.’s and Costa et al.’s, the proposed model achieves consistently lower optimality gaps on TSP50 and TSP100, while exhibiting only marginal differences on TSP20. It is important to note that while Bello et al. report a lower objective value of 5.7 on TSP50, this does not translate to superior performance due to differences in the datasets used. Such variability prevents a direct comparison of objective values. However, based on the optimality gaps, which provide a reliable measure of performance, the proposed model outperforms Bello et al. on TSP50.
In addition to achieving outstanding solution quality, the proposed model demonstrates competitive efficiency when executed in sampling mode. Specifically, it solves TSP20, TSP50, and TSP100 instances by sampling 1,024 candidates in 17.732 seconds, 41.050 seconds, and 82.561 seconds, respectively. In contrast, OR Tools with 1,024 samples requires significantly longer execution time of 20.604 seconds, 163.074 seconds, and 803.508 seconds for the same problem sizes, exhibiting lower computational efficiency. The execution times reported for Bello et al., Joshi et al., and Costa et al. [45] are derived from their respective studies and are obtained using different hardware platforms. Although this prevents a direct and fair comparison, the substantial time differences remain noteworthy. Even accounting for platform variability, the observed discrepancies suggest that the proposed model achieves a significant advance in computational efficiency.
The experimental results presented in Table 1 and Table 2 clearly illustrate the superior and consistent performance of the proposed model using both greedy and sampling decoding strategies to address the challenges posed by the TSP of varying sizes, establishing it as a robust and effective approach for tackling TSPs.
Comparison with transformer-based methods
The primary focus of this study is to enhance the efficacy of solving TSP using the Transformer model. As a result, studies employing the Transformer architecture, such as those proposed by Deudon et al. [16], Kool et al. [19], and Bresson et al. [23], serve as particularly pertinent benchmarks owing to their structural and training paradigm similarities. To ensure a fair comparison and minimize performance disparities due to differences in hardware, software platforms, data quantity, and training duration, all models are retrained under identical conditions and evaluated on the same test datasets as our proposed model. Each model undergoes training for 200 epochs. To conduct a comprehensive performance comparison, we test the final models under both greedy decoding and sampling decoding on our benchmark test datasets. Additionally, to facilitate an equitable assessment of the runtime, irrespective of hardware acceleration factors such as CPU, GPU, and parallel computation, each final model and the Concorde are performed on an Intel Xeon Platinum 8255C (2.50 GHz) CPU in a single-threaded context. The runtime for both Concorde and greedy decoding is reported as the average runtime across 10,000 test instances, while for sampling, it is calculated based on the average of 50 instances. The test outcomes of the final models are summarized in Table 3.
From Table 3, it can be observed that our model surpasses both Deudon et al.’s [16] and Kool et al.’s [19] models across all three TSP problems in terms of solution quality. This superiority becomes more pronounced as the problem size increases. Employing greedy decoding, our model achieves optimality gaps of 0.10%, 0.81%, and 2.72% for TSP20, TSP50, and TSP100, respectively. Relative to Deudon et al.’s model, our reductions in the optimality gap are 0.32%, 1.54%, and 4.13% for TSP20, TSP50, and TSP100. Similarly, compared to Kool et al.’s model, the reductions of the optimality gap are 0.11%, 0.49%, and 1.36%. In comparison with the state-of-the-art model proposed by Bresson et al., both models exhibit comparable performance, exhibiting only marginal difference. While Bresson et al.’s model shows superior performance on TSP100, our model slightly outperforms Bresson et al. on TSP20 and TSP50 in terms of the optimality gap. The application of sampling decoding further improves the solution quality of each model to varying degrees. Under the sampling mode, our model markedly reduces the optimality gap to 0.03%, 0.16%, and 1.13% for TSP20, TSP50, and TSP100, maintaining a consistent advantage over Deudon et al.’s and Kool et al.’s models across all problems, while also demonstrating comparability with Bresson et al.’s model. Therefore, these results substantiate the effectiveness and competitiveness of our model in proficiently solving TSP.
We conduct a comprehensive analysis of the runtime efficiency of our approach. As illustrated in Table 3, when operating under the greedy decoding mode, our model exhibits remarkable efficiency, requiring only 0.016, 0.037, and 0.076 seconds on average to solve a single instance of TSP20, TSP50, and TSP100, respectively. This performance surpasses that of Concorde and Bresson et al.’s model, and closely aligns with Kool et al.’s model. Our model preserves the rapid problem-solving capability inherent in end-to-end models. Upon transitioning to the sampling decoding mode, our model accumulates a total time cost of 17.732, 41.050, and 82.561 seconds for sampling 1,024 solutions for each problem instance, respectively. This time expenditure is significantly lower than that of Bresson et al.’s model. While the running time of our model is slightly higher compared to Deudon et al.’s and Kool et al.’s models, the solution quality is noticeably improved. The additional time consumption can be attributed to the incorporation of spatial encoding and centrality encoding in the encoder, introducing a degree of complexity and computational overhead. It is noteworthy that while Deudon et al.’s model exhibits the highest execution efficiency, its optimization performance falls significantly short compared to our proposed model. In contrast, Bresson et al.’s model achieves state-of-the-art optimality gaps on TSP100. However, this advancement is achieved through an increased number of attention calculation layers in the decoder phase, leading to a processing time of 170.149 seconds, which is more than double the 82.561 seconds consumed by our model. Therefore, our model demonstrates an outstanding balance between time efficiency and solution quality.
This study additionally compares the stability of Transformer-based models concerning solution quality. Employing greedy decoding, the four models are tested with identical test datasets (size = 10,000) on TSP20, TSP50, and TSP100. The results generated by Concorde are adopted as the benchmark solutions, against which the optimality gap (in percentage) is calculated for each solution. Fig 5 offers a visual representation of the distribution of optimality gaps and Table 4 provides the specific statistics, including the first quartile (Q1), the second quartile (Q2) (also known as the median), and the third quartile (Q3), along with the interquartile range (IQR). The IQR represents the span between Q3 and Q1, delineating the range within which the central 50% of the data resides.
Fig 5 clearly displays that the proposed model consistently exhibits a superior and tight distribution of optimality gaps across all three scales of TSP, surpassing the performance of Deudon et al. and Kool et al. by a significant margin. The superiority becomes more evident with increasing problem size. Referring to Table 4, the Q1, Q2, Q3 and IQR values of the proposed model in TSP20 are equal to or highly close to 0%. This indicates that our model achieves a very high and stable solution quality on small-scale problems, with at least half of the solutions equaling the quality of those generated by the Concorde solver. For TSP50, the quartile values of the proposed model are as follows: Q1=0.16%, Q2=0.58%, and Q3=1.24%. These values imply that when using greedy mode to solve a problem just once, our model has a 25% chance of approximating the optimal solution within a 0.16% gap, a 50% probability of keeping the solution within a 0.58% gap from the optimal, and 75% assurance of maintaining the optimality gap within 1.24%. This demonstrates the high reliability of our model’s solutions. In the case of TSP100, the Q1, Q2, and Q3 values are 1.74%, 2.60%, and 3.56%, with a slight widening of the optimality gap as the problem size increases. However, compared to Deudon et al. and Kool et al., our model exhibits markedly lower quartile values. Moreover, the IQRs of TSP50 and TSP100 are 1.08% and 1.81%, respectively, indicating that the fluctuation in the quality of the majority solutions is small. Additionally, from Fig 5 and Table 4, it can be observed that the distribution of the optimality gaps of our model’s solutions is slightly better than that of Bresson et al. on TSP20 and TSP50. On TSP100, the distribution of the optimality gap of our model’s solutions is comparable to that of the Bresson et al. These statistical findings illustrate that the proposed model is able to maintain remarkable stability in delivering reliable solutions across problems of various sizes.
Ablation study
To illustrate the influence of the proposed encoding and decoding mechanisms on model performance, we conduct a comprehensive ablation study on the TSP50 dataset. Multiple model variants, each incorporating different subsets of the proposed enhancements, are trained under identical hyperparameter configurations and computational conditions. Their performance is evaluated on a fixed validation set of 10,000 instances, using percentage-wise optimality gaps relative to Concorde. The results, depicted in Fig 6, demonstrate that both the encoding and decoding mechanisms independently improve performance, and their combined integration yields the greatest overall improvement.
Starting with the baseline model (blue curve), we observe a moderate convergence rate and a certain steady-state optimality gap. By introducing the proposed encoding mechanism—incorporating spatial encodings to capture relational information among nodes and centrality encodings to reflect underlying graph-theoretic importance—we obtain a model (green curve) that learns richer node representations. Although this enhancement initially slows early-stage training due to increased model complexity, it ultimately facilitates better discernment of global and local structural patterns, enabling the model to identify higher-quality tours. This improvement underscores the value of our encoding for transforming raw TSP data into more semantically rich and geometrically meaningful representations.
Similarly, augmenting the baseline with our decoding mechanism (purple curve) leads to substantial performance improvements. Grounded in a reinforcement learning framework, the decoder models the TSP as a Markov Decision Process, where decisions about the next node are guided by a state representation that reflects the current progress of the tour. By incorporating the entire set of visited nodes as a crucial component, the decoder gains a more accurate and holistic view of the decision context, enabling it to make better-informed decisions. This approach aligns with the theoretical foundations of the Held-Karp [46] algorithm, which ensures global optimality by considering subsets of visited nodes. By leveraging this enriched decoding context, instead of relying solely on limited local information, such as the last visited node or a narrow window of recent nodes, the decoder achieves faster convergence and significantly reduces the final optimality gap compared to the baseline. These results underscore the ability of the decoder to more effectively approximate globally optimal solutions and highlight the importance of integrating comprehensive state representations into the decision-making process.
The most pronounced improvement emerges when both the enhanced encoder and the globally informed decoder are integrated (red curve). In this configuration, the model benefits simultaneously from comprehensive structural representations and globally coherent decision-making, accelerating convergence and achieving the smallest overall optimality gap. This synergy suggests that the superior performance of the full model is not attributable to a single innovation, but rather to the complementary effects of both the encoding and decoding strategies.
Scalability and limitations
Scalability is a pivotal aspect of any algorithm in combinatorial optimization, yet scaling end-to-end approaches to more complex, realistic instances remains an open problem [1,28,38,39,47,48]. Training learning-based models on large-scale graphs is computationally intensive, and models trained solely on small-scale datasets often struggle to larger instances due to distributional changes [39,47,49,50]. For example, Joshi et al. [47] reported that even after training directly on 12.8 million TSP200 samples for more than 500 hours, their learning-based model did not outperform a simple insertion heuristic, underscoring the difficulty of scaling to larger problem sizes.
The Transformer architecture, with its quadratic complexity, further intensifies these computational challenges by substantially increasing both the memory usage and the demand for extensive data. In our own attempts, training a TSP200 model with a batch size of 512 under our experimental setting required approximately 53GB of GPU memory, which significantly exceeds the capacity of typical academic-scale hardware. Even using an NVIDIA A800 GPU, the training process required approximately 40 minutes per epoch, and progress became minimal after 28 epochs, prompting early stopping. Notably, other closely related studies [16,19,23] also restrict their models to TSP100. These findings underscore the prohibitive nature of fully training models on larger instances with typical hardware, reflecting a widely recognized scalability barrier encountered by current Transformer-based approaches.
Despite these challenges, evaluating the scalability of the proposed model under more demanding and practical conditions remains crucial for understanding its strengths and limitations. To this end, we tested our medium-scale TSP50 model on a diverse set of TSPLIB instances. These instances vary considerably in size, ranging from approximately 1 × to 4 × the default model size, thereby allowing us to gain insights into the model’s performance across a broad range of scenarios. For each instance, we adopted a sampling mechanism (1,024) to generate the solution, and the results are presented in Table 5.
The experimental results demonstrate that the proposed model performs well on small-scale problems comparable in size to the pre-trained model, achieving the lowest gaps among the comparative algorithms on eil51 (0.23%) and pr76 (2.54%). It also performs competitively on eil76, with a gap of 2.60% slightly trailing Bresson et al.’s 2.23%. Despite being trained solely on randomly generated data, our model performs robustly on these instances, suggesting its potential for addressing small-scale applications, such as logistics optimization for regional delivery routes and vehicle routing for local transportation networks. However, when the problem size doubles, as in the cases of eil101 and pr124, the model’s performance shows a noticeable decline. When the problem size scales to around four times the default model size, the decline becomes significant, with the gap expanding to 17.20% on pr226, encountering significant scalability limitations. Notably, these limitations are consistent across the comparative models, such as those of Kool et al. and Bresson et al. on pr226. Moreover, none of the models exhibits consistent dominance in these test instances. These findings reaffirm that scalability remains a pervasive challenge in the field of neural combinatorial optimization.
The commendable performance of the proposed model on small-scale problems can be attributed to the effective integration of spatial and centrality encodings, which introduce beneficial biases in how the Transformer attends to the input. For smaller problems, these encodings emphasize critical features, such as the centrality of a city or local spatial relationships, guiding the model’s attention to important parts of the problem. However, larger-scale TSPs often feature more uniform node distributions or complex, long-range interactions that are not as effectively captured by the same encodings. These biases, which are advantageous for small instances, may mislead the model at scale by overemphasizing features that become less informative or even counterproductive in larger graphs. Adaptive encoding schemes that dynamically adjust to problem size and data distribution, along with transfer learning and data augmentation, are promising approaches to address these scalability challenges.
Conclusion
End-to-end models built upon the Transformer have become a research hotspot in neural combinatorial optimization due to their outstanding efficacy. This study explores how to integrate graph structure information into the Transformer framework to further enhance the performance of neural combinatorial optimization and proposes a structure-aware model specifically for solving TSP. Through the additional closeness centrality encoding and spatial encoding, the proposed model augments Transformer’s capacity to perceive the graph structure inherent in TSP instances, especially for smaller-scale problems. Furthermore, this study introduces a modified decoding mechanism that not only emphasizes the starting node and the previously visited node, but also captures the dynamic evolution of the tour by covering all the nodes that have been visited so far. This design provides the decoder with more informed decision states for making better predictions and ultimately improves the overall performance of the model. Trained through deep reinforcement learning, the proposed model achieves results closely approaching optimality for 2D Euclidean TSP instances with up to 100 nodes, surpassing various comparable models in terms of solution quality. Coupled with its rapid solving speed and robust stability, it shows an outstanding balance between efficiency and effectiveness, positioning it as a competitive choice for tackling TSP. This study demonstrates the positive influence of graph structure information in addressing TSP. Additionally, it serves as a showcase for developing effective improvement mechanisms to leverage this valuable information, thereby enhancing the performance of neural combinatorial optimization models based on Transformer.
Scalability is a common issue in the field of neural combinatorial optimization, particularly for large problem instances and practical applications. As discussed previously, the encoding and decoding mechanisms play a critical role in determining the performance of the model. Consequently, designing more effective encoding and decoding mechanisms to enhance the model’s scalability across various problem scales warrants further exploration. Additionally, model training requires substantial computational resources, which limits the model’s ability to efficiently learn from large-scale instances. Approaches such as sparse attention mechanisms and hierarchical models show promise in reducing computational complexity, making them valuable directions for future investigation. For applications to practical benchmark instances, such as those in the TSPLIB benchmark, the model performance often lags behind its success on synthetic datasets. Techniques like fine-tuning and domain adaptation could improve generalization to these applications. However, effective strategies for adapting the model to such instances remain underexplored and represent an important avenue for future research.
Supporting information
Data. The accompanying ZIP package includes data used in this research.
https://doi.org/10.1371/journal.pone.0319711.s001
(ZIP)
References
- 1. Bengio Y, Lodi A, Prouvost A. Machine learning for combinatorial optimization: a methodological tour d’horizon. Eur J Oper Res. 2021;290(2):405–21.
- 2. Song H, Triguero I, Ozcan E. A review on the self and dual interactions between machine learning and optimisation. Prog Artif Intell. 2019;8(2):143–165.
- 3. Li K, Zhang T, Wang R, Qin W, He H, Huang H. Research reviews of combinatorial optimization methods based on deep reinforcement learning. ACTA Automat Sinica. 2021;47(11):2521–37.
- 4. Choong S, Wong L, Lim C. An artificial bee colony algorithm with a modified choice function for the traveling salesman problem. Swarm Evolution Comput. 2019;44:622–35.
- 5.
Vinyals O, Fortunato M, Jaitly N. Pointer networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. vol. 2. 2015. p. 2692–700.
- 6.
Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. vol. 2. 2014. p. 3104–12.
- 7.
Cho K, Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014. p. 1724–34.
- 8. Bello I, Pham H, Le Q, Norouzi M, Bengio S. Neural combinatorial optimization with reinforcement learning. arXiv preprint. 2016. http://arxiv.org/abs/1611.09940
- 9. Feiyu Z, Dayan L, Zhengxu W, Jianlin M, Niya W. Autonomous localized path planning algorithm for UAVs based on TD3 strategy. Sci Rep 2024;14(1):763. pmid:38191590
- 10.
Nazari M, Oroojlooy A, Takac M, Snyder LV. Reinforcement learning for solving the vehicle routing problem. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. Curran Associates Inc.; 2018. p. 9186–871.
- 11. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput 1997;9(8):1735–80. pmid:9377276
- 12. Ma Q, Ge S, He D, Thaker D, Drori I. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. arXiv preprint 2019.
- 13. Bera A, Wharton Z, Liu Y, Bessis N, Behera A. Sr-gnn: Spatial relation-aware graph neural network for fine-grained image categorization. IEEE Trans Image Process. 2022;31(9):6017–31.
- 14.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc.; 2017. p. 6000–10.
- 15. Islam S, Elmekki H, Elsebai A, Bentahar J, Drawel N, Rjoub G, et al. A comprehensive survey on applications of transformers for deep learning tasks. Exp Syst Appl. 2024;241:122666.
- 16.
Deudon M, Cournut P, Lacoste A, Adulyasak Y, Rousseau LM. Learning heuristics for the tsp by policy gradient. In: Integration of constraint programming, artificial intelligence, and operations research. Springer; 2018. p. 170–81.
- 17. Williams RJ. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn. 1992;8(2):229–56.
- 18. Kaempfer Y, Wolf L. Learning the multiple traveling salesmen problem with permutation invariant pooling networks. arXiv preprint. 2018. http://arxiv.org/abs/1803.09621
- 19.
Kool W, Hoof HV, Welling M. Attention, learn to solve routing problems! In: International Conference on Learning Representations; 2019.
- 20.
Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual; 2024. Available from: https://www.gurobi.com
- 21. Helsgaun K. An extension of the Lin-Kernighan-Helsgaun TSP solver for constrained traveling salesman and vehicle routing problems. Roskilde: Roskilde University. 2017;12.
- 22.
Cook WJ, Applegate DL, Bixby RE, Chvatal V. Concorde TSP Solver; 2024. Available from: https://www.math.uwaterloo.ca/tsp/concorde/index.html
- 23. Bresson X, Laurent T. The transformer network for the traveling salesman problem. arXiv preprint 2021.
- 24.
Dai H, Khalil EB, Zhang Y, Dilkina B, Song L. Learning combinatorial optimization algorithms over graphs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc.; 2017. p. 6351–61.
- 25. Manchanda S, Mittal A, Dhawan A, Medya S, Ranu S, Singh A. Learning heuristics over large graphs via deep reinforcement learning. arXiv preprint. 2019. http://arxiv.org/abs/1903.03332
- 26.
Li Z, Chen Q, Koltun V. Combinatorial optimization with graph convolutional networks and guided tree search. In: Neural information processing systems. Curran Associates Inc.; 2018. p. 537–46.
- 27. Nowak A, Villar S, Bandeira A, Bruna J. Revised note on learning algorithms for quadratic assignment with graph neural networks. arXiv preprint. 2017. http://arxiv.org/abs/1706.07450
- 28. Joshi C, Laurent T, Bresson X. An efficient graph convolutional network technique for the travelling salesman problem. arXiv preprint. 2019. http://arxiv.org/abs/1906.01227
- 29. Gao L, Chen M, Chen Q, Luo G, Zhu N, Liu Z. Learn to design the heuristics for vehicle routing problem. arXiv preprint. 2020. http://arxiv.org/abs/2002.08539
- 30. Lei K, Guo P, Wang Y, Wu X, Zhao W. Solve routing problems with a residual edge-graph attention neural network. Neurocomputing. 2022;508(10):79–98.
- 31.
Ying C, Cai T, Luo S, Zheng S, Ke G, He D. Do transformers really perform badly for graph representation? In: Proceedings of the 35th International Conference on Neural Information Processing Systems. Curran Associates Inc. 2024. p 28877–88.
- 32.
Mao G, Zhang N. A fast method to find nodes with the hightest closeness centrality ranking in directed BA networks. In: 2022 6th Asian Conference on Artificial Intelligence Technology (ACAIT). IEEE; 2022. p. 1–4.
- 33.
Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the International Conference on Machine Learning. 2015. p. 448–56.
- 34.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.p. 770–8.
- 35. Nilsson C. Heuristics for the traveling salesman problem. vol. 38. Linkoping University. 2003;00085–9.
- 36.
Wong L, Khader A, Al-Betar M, Tan T. Solving asymmetric traveling salesman problems using a generic bee colony optimization framework with insertion local search. In: 2013 13th International Conference on Intelligent Systems Design and Applications. IEEE; 2013. p. 20–7.
- 37. Choong S, Wong L, Low M, Chong C. A bee colony optimisation algorithm with a sequential-pattern-mining-based pruning strategy for the travelling salesman problem. Int J Bio-Inspired Computat. 2020;15(4):239–53.
- 38.
Luo F, Lin X, Liu F, Zhang Q, Wang Z. Neural combinatorial optimization with heavy decoder: toward large scale generalization. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. Curran Associates Inc.; 2024. p. 8845–64.
- 39. Xu Y, Fang M, Chen L, Du Y, Xu G, Zhang C. Shared dynamics learning for large-scale traveling salesman problem. Adv Eng Inform. 2023:56(4);102005.
- 40. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforcement learning. Nature 2015;518(7540):529–33. pmid:25719670
- 41. Hsu H, Lachenbruch P. Paired t test. Wiley StatsRef: statistics reference online. 2014.
- 42. Rennie S, Marcheret E, Mroueh Y, Ross J, Goel V. Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017;2017:7008–24.
- 43. Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv preprint. 2014. http://arxiv.org/abs/1412.6980
- 44. Clarke G, Wright J. Scheduling of vehicles from a central depot to a number of delivery points. Oper Res. 1964;12(4):568–81.
- 45.
Da Costa P, Rhuggenaath J, Zhang Y, Akcay A. Learning 2-opt heuristics for the traveling salesman problem via deep reinforcement learning. In: Asian Conference on Machine Learning. PMLR; 2020. p. 465–80.
- 46. Held M, Karp RM. The traveling-salesman problem and minimum spanning trees. Mathematical programming. 1970.
- 47. Joshi C, Cappart Q, Rousseau L, Laurent T. Learning the travelling salesperson problem requires rethinking generalization. Constraints. 2022;27(1):70–98.
- 48. Cheng H, Zheng H, Cong Y, Jiang W, Pu S. Select and optimize: Learning to solve large-scale tsp instances. In: Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR; 2023. p. 1219–31.
- 49. Yang H, Zhao M, Yuan L, Yu Y, Li Z, Gu M. Memory-efficient transformer-based network model for traveling salesman problem. Neural Netw. 2023;161:589–97. pmid:36822144
- 50. Fu Z, Qiu K, Zha H. Generalize a small pre-trained model to arbitrarily large tsp instances. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021;35:7474–82.