Optimization of shunting operation plan in large freight train depot based on DQN algorithm

Jiandong Qiu; Shusheng Xu; Minan Tang; Jiaxuan Liu; Hailong Song

doi:10.1371/journal.pone.0320762

Abstract

Shunting operation plan is the main daily work of the freight train depot, the optimization of shunting operation plan is of great significance to improve the efficiency of railway operation and production and transportation. In this paper, the deep reinforcement learning (DRL) environment and model of shunting operation problem are constructed by three elements: action, state and reward, taking shunting locomotive as the agent, the lane number of the fall-down train group as the action, the fall-down conditions of the train group as the state, and design the reward function based on the total number of shunting hooks generated after the group’s descent and reorganization. The model is solved using the Deep Q network (DQN) algorithm with the objective of minimizing the number of shunting hooks, the optimal shunting operation plan can be solved after sufficient training. DQN is verified to be effective through example simulations: Compared to the overall planning and coordinating (OPC) method, DQN produces a shunting operation plan that occupies fewer lanes and produces 10% fewer total shunting hooks. Compared to the binary search tree (BST) algorithm, DQN produces 5% fewer total shunting hooks. Compared with the branch and bound (B&B) algorithm, DQN takes less time to solve, and the number of freight train removed by the coupling and slipping operations is reduced by 5.3% and 2.9%, respectively, and the quality of the shunting operation plan is better. Therefore, this paper provides a new solution for the intelligentization of shunting operations in large freight train depot.

Citation: Qiu J, Xu S, Tang M, Liu J, Song H (2025) Optimization of shunting operation plan in large freight train depot based on DQN algorithm. PLoS ONE 20(4): e0320762. https://doi.org/10.1371/journal.pone.0320762

Editor: Muhammad Ayaz Arshad, University of Tabuk, SAUDI ARABIA

Received: September 6, 2024; Accepted: February 24, 2025; Published: April 8, 2025

Copyright: © 2025 Qiu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript. There is no extra data required for Supporting Information files.

Funding: This study was supported by the National Natural Science Foundation of China [Grant no. 62363022, 61663021, 71763025, and 61861025]; Natural Science Foundation of Gansu Province [Grant no. 23JRRA886]; Gansu Provincial Department of Education: Industrial Support Plan Project [Grant no. 2023CYZC-35]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The large train depot has a huge amount of annual train overhaul, more lines in the section, a large amount of freight trains in stock, and with the shunting operation, the storage location of the trains has been in a state of dynamic change [1]. If in accordance with the daily overhaul operation requirements, in the many shares of the many vehicles, select the train to be repaired, a larger number of shunting operations are required. And now the preparation of shunting operation plan, basically using manual way, according to the experience of the preparer to complete, whether there is optimization space, there is no reliable judgment standard. The large amount of shunting operations, brought about by low productivity, igh operating costs, increased safety risks and other issues [2], appears to be very prominent. The shunting operation has also become a safety critical point in the train depot.

Many scholars have conducted in-depth research on the problem of shunting operation planning. Song et al [3,4] put forward the concepts of locating the train group, must adjustable train group and passing adjustable train group, and from determining the position of locating the train group at each station in the train to be organized, he discusses the method of determining the whereabouts of the train group and the adjustable train group in the computerized preparation of train grouping hook plan. When the number of fall-down trains is not an integer power of 2, Song [5,6] used the analytical calculation method to propose the quantitative criteria for evaluating the shunting hook plan of the train grouping of the pick-and-hook train, i.e., converting the number of slipping hooks, effectively using the redundant counterparts and optimizing the screening of shunting plans. Gao et al [7] proposed the "Elimination of inverse-order", which realizes the optimization of shunting operation through the two phases of "inverse substitution" and "positive sequence establishment". This method is especially effective when the number of lines is limited. According to the principle of shunting operation, Wang et al [8] abstracted the train fall-down problem into a sequencing problem and proposed a binary tree-based planning method, which utilizes BST to obtain an ordered sequence of trains as a selection set for the fall-down scheme, and then screened according to the screening conditions to obtain the optimal fall-down scheme. Shi et al [9] proposed a second-level model of Lexicographic Goal Programming by considering flexible storage and column occupancy, which solves the problem of improving the quality of shunting operation planning for moving train stations under limited facility conditions. With the objective of reducing the number of lane occupancy during shunting operations and taking multiple constraints into consideration, Hu et al [10] designed a simulated annealing algorithm based on the generation of feasible paths for shunting operations and the exchange of operation priorities to solve the problem, which can obtain an optimized shunting operation plan in a relatively short period of time. Chen et al [11] pointed out the strong coupling of the collaborative preparation process of the technical operation plan of the high-speed railroad hub station and the shunting operation plan of the moving train station, and designed a hybrid optimization algorithm combining the bottleneck process, the heuristic allocation rule, and the parallel tabu search algorithm to solve the problem. Zhang et al [12] addressed how to efficiently prepare the shunting operation plan for picking trains by station, using a 0-1 linear programming model and a heuristic branch delimitation algorithm based on the elimination of the inverse rule to solve the problem of preparing the shunting operation plan for picking trains.

With the wave of technological innovation, the intelligent demand of railway shunting work is growing, and the traditional optimization algorithm has gradually appeared to be incompetent and difficult to meet the growing intelligent demand [13]. The emergence of reinforcement learning (RL), an emerging technology, brings new solutions and ideas for railway shunting work. It continuously optimizes the decision-making process through the interactive learning between the agent and the environment, and promotes the development of railway shunting work in the direction of more efficient and more intelligent.

RL is an important branch of machine learning [14,15], which is based on the idea that an agent learns optimal behavioral strategies through interaction with its environment. DRL is the product of combining deep learning (DL) with RL, which enables an agent to learn complex strategies through interaction with the environment in order to solve problems in high-dimensional and continuous action spaces. The rise of DRL can be traced back to the proposal of DQN, a milestone in the field of DRL as this work demonstrated for the first time that DL can be successfully applied to RL tasks. The DQN algorithm solves the scaling problem of the traditional Q-learning algorithms [16,17] in high-dimensional state spaces by estimating the Q-value using a deep neural network. The DQN algorithm was first proposed by the DeepMind team in 2013 [18] and improved in the 2015 paper [19] by attempting to reduce the dependency between the computation of the target Q-value and the parameters of the Q-network to be updated with two Q-networks, which demonstrates the application of the DQN algorithm to the Atari 2600 game, proving its surpassing of the human player’s capabilities.

In railway shunting, Hirashima [20,21] proposed a Q-learning based RL approach aimed at solving the problem of shifting and dispatching freights in a train through autonomous learning. Šemrov et al [22] proposed a Q-learning based train re-scheduling method, which is capable of efficiently adjusting the train’s operation plan to reduce delays and improve the reliability of the railway system in the event of major disruptions in train operations. References [23,24] have mainly used DRL methods to solve the train unit shunting problem faced by Dutch Railways. Shi et al [25] proposed an optimization method combining the table shunting method, RL and Q-learning algorithm, which is able to solve the quality approximation of shunting operation plan in a shorter time compared to the traditional method.

In summary, the development of artificial intelligence technology, especially the application of RL in optimization problems, provides a new way to improve the intelligence level of shunting operation plan. DQN algorithm, as a leader in RL, has already demonstrated its powerful optimization ability in many fields. However, when directly applied to the optimization of shunting operation plans for large freight trains, the DQN algorithm faces challenges such as huge state space, low learning efficiency, and weak strategy generalization ability. To address these issues, this paper proposes a DQN algorithm for the shunting operation characteristics of large freight trains, aiming to improve the optimization quality of the shunting operation plan by enhancing the exploration capability and learning efficiency of the algorithm. By introducing advanced experience playback mechanism, goal network and prioritized experience playback, the improved algorithm in this paper is able to learn more stable and effective shunting strategies in complex environments. Fig 1 shows a schematic diagram of the shunting operation plan generation process.

Download:

Fig 1. The process of generating a shunting operation plan.

Considering the shunting yard as an environment for the reinforcement learning model to interact with, the agent learns continuously with the environment and finally outputs the shunting operation plan.

https://doi.org/10.1371/journal.pone.0320762.g001

In this paper, DRL technology is applied to the preparation process of shunting operation plan for freight trains. Shunting locomotive is selected as agent, and a RL model is constructed through the three elements of actions, states and rewards, and neural networks are added on the basis of traditional Q-learning algorithm, which are used to replace the Q-table in Q-learning algorithm. By introducing the two key techniques of experience replay and fixed Q-targets, the deep neural network is used to approximate the state-action value function (Q-value), and the optimal shunting operation plan can be obtained when the Q-value converges. The algorithm can generate the shunting plan in an optimal way according to the current status of the freight stock in the train depot and the maintenance plan of the day, and reduce the number of shunting operations, thus reducing the cost and improving the efficiency and ensuring the safety. The main contributions of this paper include:

(1) Due to the large number of trains stored in large train depot and the large number of lanes, this paper proposes a kind of DQN algorithm suitable for the characteristics of shunting operations of large freight trains, by reasonably setting the action and state space of the environment and designing a set of reward mechanisms suitable for shunting operation optimization, in order to guide the algorithm to learn a more economical and efficient shunting plan.
(2) Verified by simulation experiments, the algorithm in this paper improves the performance in shunting operation plan optimization compared with OPC, BST and B&B, which provides a new solution and theoretical support for the intelligent scheduling of railway freight train depots.

The structure of this paper is as follows: Section 2 describes the process of preparing the shunting operation plan. Section 3 establishes a DRL model in conjunction with the problem. Section 4 solves the problem using DQN algorithms. Section 5 analyzes several cases with arithmetic examples. The last section summarizes the paper.

Description of the problem

For the trains to be overhauled that are staying on the storage line of the train depot, the disordered and random groups of trains are organized into groups of a specific order in accordance with the grouping order of the overhaul plan. The process of group fall-down can be carried out using the shunting table, as shown in Table 1. The rows in the shunting table represent the lanes that can be used to park the units, and the columns in the table represent the mutual positions of the units in the train to be organized. It is assumed that the lead track is located on the left side of the shunting yard, and the left side of the freight train is the front and the right side is the rear. The shunting locomotive operates at the left end of the freight train, and connects the freight trains according to the order of the station. The order of the wagons is increasing from left to right.

Download:

Table 1. Fall-down plan table.

https://doi.org/10.1371/journal.pone.0313772.t001

Connection: Starting from the front of the train, the number of neighboring groups is incremented or equal and the difference is not greater than 1.

Non-connection: the neighboring trains do not meet the conditions of connection of the group connection form.

Temporary joint sequence: the existence of a non-connection form of the group of trains in the sequence.

For example: a train to be organized in the train group numbered ’1,2,3,5,4,6,7’. The group of trains numbered 1 is adjacent to the group of trains numbered 2, the number is arranged in increasing order, and the difference in the number is not greater than 1, so the two groups constitute a connection form; The group of trains numbered 3 is adjacent to the group of trains numbered 5, which is arranged in increasing order, but the difference in numbering is 2 (greater than 1), so the two groups constitute a non-connection form; The group of trains numbered 5 is adjacent to the group of trains numbered 4, but the numbering is not in increasing order, so the two groups also constitute a non-connection form. The rest of the neighboring groups are connected. There are two non-connections in this train so the train is a temporary joint sequence.

Adjustment of group order is realized through the shunting locomotive. The process of shunting locomotive to complete a process of attaching a group of trains or detaching a group of trains is called a shunting hook, which is divided into attaching hook and detaching hook. The number of shunting hooks directly affects the efficiency of shunting operations, so the essence of shunting operation plan optimization is to reduce the time of shunting operations to improve the efficiency of the operation by reasonably arranging the shunting hooks to achieve the minimum value. Fig 2 shows the schematic diagram of attaching and detaching hook in a lead track shunting.

Download:

Fig 2. Schematic diagram of lead track shunting.

The shunting locomotive is shown in blue and the group of cars to be organized is shown in red. The figure shows the process of completing a pull-out line shunting.

https://doi.org/10.1371/journal.pone.0320762.g002

Deep reinforcement learning modeling

The shunting operation plan is divided into 2 parts: fall-down and reorganization, and the RL model is constructed by setting up three elements: action, state and reward.

Parameter definitions

Some of the variables in the model that will be developed in this paper are now explained as shown in Table 2.

Download:

Table 2. Variables used in this paper.

https://doi.org/10.1371/journal.pone.0313772.t002

Reorganization process after the fall-down

After the completion of the fall-down, the trains on the shunting yard are reorganized to realize the requirements of grouping according to the station. The group reorganization is realized by attaching and detaching the trains after the fall-down. The conditions of attaching, detaching and the process of group reorganization are as follows [25].

(1) Attaching conditions

Condition 1: Both and its left end group can form a connecting form with the group attached to the shunting locomotive and there is no group with a larger number than in the lane.

Condition 2: Search for the temporary joint sequence in descending order of the number of disjunctions, where there exists or , and i ≠ m.

Condition 3: or , and i ≠ m, exists in all connecting form groups.

When any of the above conditions for attaching trains are met, attach and its left train group to the shunting locomotive.

(2) Detaching conditions

Condition 4: There exists such that .

Condition 5: There exists such that .

When any of the above conditions for attaching trains are met, the rightmost train group connected to the shunting locomotive will be slipped to the corresponding lane.

(3) Train group reorganization process

The reorganization process of the car group is shown in Fig 3. After the completion of the fall of the train, the rest of the train group need to first according to the shunting table in order to determine the attaching conditions 1-3, to determine whether attach trains. Then determine whether the train group on the shunting locomotive constitutes a connection, whether there is no train group in the shunting table with a larger number than E, and there are remaining train groups in the shunting table will continue to determine the attaching conditions until they are not met. Then, according to the conditions 4-5 to determine whether to detach the train, slip the train group until all the train groups on the shunting locomotive form a connected form, and there is no train group with a larger number than E in lane. At this time, if this round has not been attached or detached operations, the E group of trains randomly slipped to any lane, and once again to determine the attaching and detaching conditions. Cycle the above process until there is no train group in the shunting table, and finally get the shunting operation schedule, and get the number of attaching hooks and detaching hooks according to the schedule.

Download:

Fig 3. Reorganization process after train group falling.

The blue color is for attaching conditions and the red color is for detaching conditions.

https://doi.org/10.1371/journal.pone.0320762.g003

Action and state of RL

Agent is subject in RL, this paper chooses to take the shunting locomotive as the agent, take the shunting yard as the environment, take the fall-down situation of all the train groups in the shunting yard as the current state, take the number of the train in the current state in the lane to be falling to as the action of the agent, the agent executes the action to interact with the environment, so as to change the state of the environment.

Use a N-dimensional row vector to represent the agent’s action on the nth train group (1 ≤ n ≤ N), with the nth component recorded as , which indicates the lane number selected by the agent. The rest of the train groups do not execute the falling action, so the rest of the components are 0. For example, if the 2nd car group falls to lane number 1, then . If there are L (the number of lanes available for shunting) actions that can be selected by the agent, then the set of actions that can be executed by a train group is .

Because the shunting yard is used as the environment, and the shunting yard can be abstracted into a shunting table to represent the falling situation of the train group. The shunting table is simplified as an N-dimensional row vector , which is used to represent the state of the shunting table after the nth train group fell down. Here, represents the lane number where the nth train group is located after it has fallen. When all train groups have fallen, the initial state of the shunting table is . When the nth train group in the train column has completed its fall, the value of changes. For example, after the second train set falls onto lane number 1, then . After the agent executes the action , the state is updated, the formula is as follows:

(1)

Where, represents the updated state. represents the state before the update. represents the action taken for the nth train group.

Reward function design

The merit of the shunting operation plan is mainly measured by the number of shunting hooks. The Q-value obtained from the accumulation of rewards after the execution of actions by the agent converges stably to a maximum value, indicating that the current algorithm has found the optimal shunting plan. The reward function is designed based on the connection status after the current train group has fallen and the total number of shunting hooks generated after the completion of the fall. The reward is divided into two parts: immediate reward and delayed reward.

Immediate rewards are determined based on the state of the composition of the current train group with the train group on the lane after the train group has completed its fall-down. The immediate reward is as follows:

(2)

Where, σ is a 0-1 variable, which takes the value of 1 when the current train group completes its fall and forms a connected state with the train group on the lane, and takes the value of 0 otherwise.

After the train group completes its fall and reorganization, the delay reward is determined based on the total number of shunting hooks generated [26,27]. The delay reward is as follows:

(3)

Where, λ is any positive number.

The immediate and delayed rewards are combined to form the cumulative reward obtained by the agent after executing the action . The cumulative reward is as follows:

(4)

Where, is the cumulative reward received by the agent for executing the action while in state . i is a natural number from 0 to n.

Problem solving

The expectation of the reward that the agent can get after executing the action is represented by the Q-value, and the larger the Q-value is, the better the current strategy is. When solving the model, the mapping relationship between the train and the optimal shunting operation plan is constructed with the goal of minimizing the shunting hooks. The completion of the falling of all train groups to be assembled is recorded as one episode. In each episode, at state , the nth train group executes action to complete the falling, and after the action is executed, an cumulative reward is obtained. The Q-value is updated based on the cumulative reward. The formula for updating the Q-value is as follows:

(5)

(6)

Where, is the updated Q-value of the agent after executing the action in state , where is determined under the guidance of policy π ( a | s ) , which represents the probability of the agent choosing action a in state s. β is the learning rate, . The higher the learning rate, the greater the proportion of results obtained from new trials, and vice versa. γ is the discount factor, . If is closer to 0, the agent will be more inclined to immediate rewards. If γ is closer to 1, the agent will consider future rewards more. ε is the exploration rate of the agent. F is the number of episodes. argmax ⁡ Q ( s , a ) is the action that maximizes the Q-value in state s. At the beginning, when the number of episodes is low, ε is higher, which allows for significant progress and learning. As the agent learns about future rewards, ε decays, which facilitates the discovery of higher Q-values.

The traditional Q-learning algorithm stores the Q-values in a Q-table and updates them according to formulas (5)-(6), which is the learning process of the agent. The DQN algorithm replaces this Q-table with a neural network, as shown in Fig 4 [28]. However, directly and simply combining the two can lead to two obvious problems:

Download:

Fig 4. Comparison of Q-learning and DQN.

A schematic of the Q-learning algorithm is shown at the top, and a schematic of the DQN algorithm is shown at the bottom.

https://doi.org/10.1371/journal.pone.0320762.g004

(1) Neural networks require that the input samples be independent of each other, unrelated, and satisfy the independent and identically distributed (i.i.d.) condition. However, the states input in RL are interrelated and do not meet the i.i.d. condition.

(2) The introduction of nonlinear functions, using neural networks to approximate the Q-table, may lead to non-convergence of the training results. For example, in regression problems, if there is correlation between the input data, it may cause the function fitted by the network to change, resulting in inaccurate predictions and a high loss.

The following two major improvements alleviate the problem of network convergence difficulties when input data do not meet the independent and identically distributed (i.i.d.) condition.

(1) Experience Replay

The memory in experience replay is used to store past experiences. Since Q-learning is an off-policy, oﬄine learning method, it can learn from current experiences, past experiences, and even the experiences of others. Therefore, incorporating previous experiences randomly during the learning process can make the neural network more efficient.

Thus, experience replay solves the problem of correlation and non-stationary distribution. It stores the transition samples obtained from the agent’s interaction with the environment into the replay memory network [29]. When training, it randomly takes out a batch for training, thereby disrupting the correlation within. By using experience replay, the advantages of off-policy can be fully utilized, where the behavior policy is used to collect experience data, and the target policy focuses solely on value maximization.

(2) Fixed Q-targets

The Q-target serves as a mechanism to break the correlation. Using Q-targets results in two networks within the DQN algorithm that have the same structure but different parameters. The Q network uses the most up-to-date parameters, while the target network uses the previous ones. represents the output of the current Q network, which is used to evaluate the value function of the current state-action pair. represents the output of the target network, which can be used to calculate the target Q and update the parameters of the Q network based on the loss function. After a certain number of episodes, the parameters of the Q network are copied to the target network. For example, you can train the Q network for 10 episodes, then assign the updated parameters to the target network, and then train the Q network for another 10 episodes, and continue this process repeatedly. After introducing the Target Network, the target Q-value remains constant for a period of time, thereby reducing the correlation between the current Q-value and the target Q-value, and enhancing the stability of the algorithm.

The Q-value update formula and policy are the same as the Q-learning update formulas (5)-(6) mentioned earlier, and the neural network parameters and are added to the aforementioned formulas [19], the formula is as follows [30,31]:

(7)

(8)

(9)

The loss function of the DQN is expressed as follows [32]:

(10)

Stochastic gradient descent is used to update the neural network parameters . The gradient of the loss function is as follows:

(11)

Download:

Fig 5. Flowchart of DQN algorithm.

The agent generate data by interacting with the environment and the obtained data is stored in the replay memory and the data is used for the training of the neural network.

https://doi.org/10.1371/journal.pone.0320762.g005

As shown in Fig 5, the agent first continuously interacts with the environment to obtain interaction data , which is stored in the replay memory. When there is enough data in the experience pool, a batch size of data is randomly taken out. The Q prediction value is calculated using the current network, and the Q target value is calculated using the target network. Then, the loss function between the two is calculated, and the gradient descent is used to update the current network parameters. After repeating this several times, the parameters of the current network are copied to the target network.

Processing techniques such as experience replay and fixed Q-targets have been added to the DQN algorithm [33]. The DQN algorithm design steps are as follows:

Step 1: Initialize the memory D, initialize the Q_eval_net (current data), and initialize the Q_target_net (historical data).

Step 2: Preprocess the environment and input the state s into the DQN.

Step 3: Select an action using an ϵ-greedy policy: with probability ϵ, we choose a random action , and with probability 1 − ε, we select the current optimal action based on the model, that is, (where represents the parameters of the neural network).

Step 4: The train group executes the selected action a in state s, and transitions to a new state with the reward .

Step 5: Store the state information in the memory D, denoted as .

Step 6: Randomly select a batch of samples from D, denoted as . In Q_target_net, calculate the true Q-value (Q_target ()) for the next state using the Q-learning algorithm. In Q_eval_net, calculate the Q-value (Q_eval ())for the current state. Use mean squared error to calculate the loss and execute gradient descent to minimize the loss.

Step 7: Replace the parameters of Q_target_net at regular intervals.

Step 8: Repeat steps 2 to 7 for M rounds.

Case analysis

To verify the effectiveness of the algorithm, the following case study analysis was conducted in this paper. The experimental operating environment is as follows: Equipment: CPU: Intel Core i5 2.50 GHz, IDE: PyCharm, Development language and tools: Python 3.11.9 and Pytorch 2.3.1. Based on the current storage state of freight trains in the depot and the daily maintenance plan, the most optimized shunting plan is generated.

The following is an example of the layout of the lines in a train depot in Lanzhou to illustrate the functional division of a freight train depot. Generally speaking, most of the train depots are only responsible for depot repair work, and only a few train depots will share part of the station repair task if their own equipment is sufficient and the space of the site permits.

The layout of the train depot can be referred to Fig 6, in which a number of lanes areas with different functions are planned inside the depot. Among them, D1-D4 belongs to depot repair operation area, which mainly accomplishes depot repair operation of trains and ensures that the key components and overall performance of trains are repaired and improved. Z1-Z3 is the station repair operation area, which mainly accomplishes the station repair task of trains, and deals with some relatively minor problems that need to be repaired quickly in the process of train operation. T is the girder adjustment area, whose core function focuses on the precise adjustment of the girders of trains to ensure the stability of the train structure. X is the tanker cleaning operation area, which removes toxic, flammable and explosive gases or liquids that may remain, so as to build a solid foundation for the subsequent safety inspection and repair. P area is the operation area of shot blasting and painting, which is responsible for shot blasting and polishing as well as painting to beautify and protect the train compartments. Area R is the pre-maintenance operation area, which carries out preliminary inspection and pre-maintenance for every train coming into the train depot detecting hidden dangers and formulating maintenance strategies in advance. S1-S7 is the storage area for trains, which mainly carries out the operations of transferring, staying and preparing trains in the depot. J area is used as the access line of the train depot to ensure the smooth entry and exit of trains.

Computational process

Taking the previously mentioned train column to be assembled ’4,6,1,4,6,1,2,5,3,2,3,1,4’ as an example, when there are 5 available lanes (L = 5), after the RL model has been fully trained, the changes in Q-value and loss value are shown in Fig 7.

Download:

Fig 6. Schematic layout of a train depot in Lanzhou.

The left side is the storage line, and the right side is the maintenance area.

https://doi.org/10.1371/journal.pone.0320762.g006

Download:

Fig 7. Training curve chart.

(a) Q-value change curve during the training process. (b) Loss value change curve during the training process.

https://doi.org/10.1371/journal.pone.0320762.g007

Initially, the Q-values are zero, and the agent’s exploration rate is high to learn various plans. As the number of training episodes increases, the Q-values continuously increase. After the agent has fully learned, the Q-values tend to stabilize, reaching a state of convergence. At this point, the sequence of actions executed by the agent, when the Q-value is maximized, represents the optimal solution. After sufficient learning, the optimal shunting table obtained by the agent is shown in Table 3.

Download:

Table 3. Fall-down plan table after training.

https://doi.org/10.1371/journal.pone.0313772.t003

Download:

Table 4. Comparison of shunting operation plans obtained by different algorithms 1.

https://doi.org/10.1371/journal.pone.0313772.t004

Effectiveness comparison

Through the following two cases, the shunting plans generated by the algorithm in this paper are analyzed and compared with those produced by OPC, BST, and B&B, thereby verifying the effectiveness and superiority of DQN.

Case 1.

Select the train ’4,2,5,3,1,2,4,4,4,6,2,7,7,5,6,1,8,6’ [8], the use of DQN, BST and DQN shunting operation plan for comparison and verification, the results are shown in Table 4. Compared with OPC, DQN in the occupied lanes under the premise of equality, the total number of hooks generated by the shunting operation of DQN is reduced by 2 hooks. Compared with BST, DQN in the occupied lanes under the premise of equality, the total number of hooks generated by the shunting operation of DQN is reduced by 1 hook. It can be seen that, under the premise of other conditions are the same, this paper’s algorithm obtains the shunting operation plan with fewer hooks, so the shunting operation plan is better.

Download:

Table 5. Comparison of shunting operation plans obtained by different algorithms 2.

https://doi.org/10.1371/journal.pone.0313772.t005

Download:

Table 6. Comparison of results of different algorithms.

https://doi.org/10.1371/journal.pone.0313772.t006

Case 2.

Select the train ’1,5,4,3,2,5,3,2,1,3’ [12], the use of traditional shunting operation algorithm and the algorithm derived from this paper shunting operation plan for comparison and verification, the results are shown in Table 5. Compared with OPC, the operation plan generated by DQN generates fewer shunting hooks and occupies fewer lanes, and the number of trains removed from the coupled operation and the number of trains removed from the slipped operation are reduced by 21.7% and 32.4%, respectively. Compared with B&B, the total number of shunting hooks generated from the shunting operation plan is the same under the premise of occupying the same number of lanes, but the number of trains removed by the couple operation and the number of trains removed by the slip operation in the method of this paper are reduced by 5.3% and 2.9%, respectively. Moreover, under the premise that the experimental equipment of this paper is not as good as that of B&B (CPU: Intel Core i7 3.4 GHz), the solution time of DQN is far less than B&B, compared with B&B of 1076s.

Table 6 and Fig 8 summarize the results of Case 1 and Case 2, which demonstrate the superiority of DQN by comparing the different algorithms.

Download:

Fig 8. Comparison of different algorithms.

(a) Different algorithms for shunting hooks in Case 1. (b) Different algorithms for shunting hooks in Case 2. (c) The number of trains removed by different algorithms in Case 2 for the coupling and slipping operations.

https://doi.org/10.1371/journal.pone.0320762.g008

Discussion

Innovations

In this study, the proposed algorithm is compared and analyzed with the OPC, BST and B&B in generating shunting operation plans through two specific examples, aiming to explore the performance of this algorithm and its significance and value in practical applications.

It is obvious from the results of the algorithm that the DQN shows significant advantages in the key indicators of shunting operations. First of all, in terms of the total number of hooks, both compared with OPC and BST, there is a reduction. This means that in the actual shunting operation, the use of DQN can reduce the operation steps, thus improving the operation efficiency and reducing the cost of labor and time. This advantage comes from the fact that DQN adopts a more optimized strategy when dealing with the sequence of train programming and lane utilization, which is able to plan the moving path of trains more accurately, avoiding unnecessary shunting operations and thus reducing the total number of hooks.

In Case 2, compared with OPC, this algorithm not only performs better in the number of shunting hooks, but also trains better results in the number of occupied lanes as well as the number of trains removed by the connecting and slipping operations. This further proves that DQN has stronger optimization ability when considering various shunting operation factors, and can ensure the smooth operation while reducing the occupation and waste of resources and improving the overall operational efficiency of railroad shunting yards.

The comparison with B&B is equally revealing. Although in some cases, such as the total number of shunting hooks, they are equal, the DQN still achieves a certain degree of reduction in the number of trains removed in both the coupling operation and the slipping operation. Moreover, it is worth noting that the solution time of the DQN is significantly reduced under the condition that the performance of the experimental equipment is not as good as that of the B&B. This indicates that DQN has an obvious advantage in computational efficiency and can provide a high-quality shunting operation planning scheme in a shorter time. This is of vital significance for the scenarios that require rapid response and frequent adjustment of operation plans in actual railroad transportation, which can effectively improve the timeliness and flexibility of railroad transportation.

Suggestions

This study also has some limitations. Although the selection of cases is representative, it may not be able to cover all possible shunting operation scenarios and situations. Future research can further expand the scope and diversity of the examples, including different sizes of trains to be made up, complex railroad yard layouts, and various special operational requirements, in order to more comprehensively verify the stability and adaptability of the algorithm in this paper.

In addition, although DQN shows superiority in the current comparison, with the continuous development of railroad transportation technology and the emergence of new operational requirements, there is still a need to continue to pay attention to and research on other possible improvement directions and optimization strategies, in order to continuously improve the performance of the algorithm, so that it can better adapt to the future development trend of railroad transportation, and provide stronger support for the efficient and safe operation of railroad shunting operations.

In conclusion, through the in-depth analysis of the cases, DQN shows the effectiveness and superiority in shunting operation plan generation, but also points out the direction for future research, which needs to be improved and optimized in a wider range of scenarios and longer-term development.

Conclusion

In this paper, a large freight train shunting operation plan optimization method is proposed on the basis of combining the table shunting method and DQN algorithm.

(1) A DQN algorithm is designed to address the characteristics of shunting operations of large freight trains, and a DRL model of the shunting operation problem of large freight trains is established by building an action and state space that conforms to the shunting environment and designing a suitable reward function, which transforms the shunting operation plan optimization problem into the problem of finding the optimal pair of dropping schemes. The mapping relationship between the train and the optimal shunting operation is established by taking the minimum number of shunting hooks as the objective, and the optimal shunting operation plan is obtained when the cumulative Q-value tends to be stable and reaches the convergence state.

(2) The reasonableness and effectiveness of this paper’s algorithm is verified through simulation experiments. Compared with OPC and the BST, DQN occupies fewer lanes and the shunting operation plan is better. Compared with the B&B, DQN can solve the shunting operation plan in less time, and under the premise that the number of shunting hooks is the same, the shunting operation plan obtained by this paper’s method is of better quality. In addition, the quality of the shunting plan obtained in this paper is better under the same number of hooks. The improved performance of this paper’s method in shunting operation plan optimization provides a new solution and theoretical support for the intelligence of shunting operation in large freight train depot.

References

1. Kang F, Wang W. Condition-based maintenance and safety of railway wagons of CHN energy. China Railway. 2021;8:109–15.
- View Article
- Google Scholar
2. Lin B, Duan J, Wang J, Sun M, Peng W, Liu C, et al. A study of the car-to-train assignment problem for rail express cargos in the scheduled and unscheduled train services network. PLoS One 2018;13(10):e0204598. pmid:30303993
- View Article
- PubMed/NCBI
- Google Scholar
3. Song J. Algorithm of turning random-ordered wagon groups into ordered wagon groups. J Lanzhou Railway Inst. 1995;14(1):66–72.
- View Article
- Google Scholar
4. Group II, Operations Research Laboratory, Institute of Mathematics, Chinese Academy of Sciences. A preliminary study of mathematical methods in railway shunting problems. Acta Math Appl Sinica. 1978;1(1):91–105.
- View Article
- Google Scholar
5. Song J. Study on optimizing shunting operation plan schemes with analysis and calculation method. J China Railway Soc. 1999;21(3):11–7.
- View Article
- Google Scholar
6. Song J. Railway shunting operation plan. Beijing: China Railway Publishing House; 2000.
7. Gao S, Zhang D. Elimination of inverse-order – A new model of shunting operation. J China Railway Soc. 2003;25(5):1–7.
- View Article
- Google Scholar
8. Wang Y, Xiao Y, Lei Y, Gui W. Automatic compilation method for marshalling coupler plan of trains detaching and attaching based on BST. China Railway Sci. 2012;33(3):116–22.
- View Article
- Google Scholar
9. Shi J, Li H, Cao H, Zheng Y. Shunting operation planning at EMU depots considering train flexible storage and position occupation. China Railway Sci 2022;43(1):152–62.
- View Article
- Google Scholar
10. Hu ZA, Zheng L, Zhou S. Optimization of shunting operation plan at electric multiple units depot considering train position proper occupancy. J Southwest Jiaotong Univ 2022;57(1):65–73.
- View Article
- Google Scholar
11. Chen T, Wang W, Lv H, Lv M, Liu X. Coordinative planning between 550 technical operation of high-speed railway hub station and shunting operation of 551 EMU depot. J China Railway Soc 2020;42(4):17–26.
- View Article
- Google Scholar
12. Zhang B, Peng Q, Li L, Lu G. Train shunting operation scheduling with number of wagon moves oriented optimization. J China Railway Soc 2020;42(3):11–20.
- View Article
- Google Scholar
13. Han L, Wang Z, Zhang C. Research on intelligent prediction method of shunting operation time in railway centralized traffic control system. Railway Stand Des 2022;66(02):143–8.
- View Article
- Google Scholar
14. Hu J, Niu H, Carrasco J, Lennox B, Arvin F. Voronoi-based multi-robot autonomous exploration in unknown environments via deep reinforcement learning. IEEE Trans Veh Technol 2020;69(12):14413–23.
- View Article
- Google Scholar
15. Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects. Science 2015;349(6245):255–60. pmid:26185243
- View Article
- PubMed/NCBI
- Google Scholar
16. Watkins CJ. Learning from delayed rewards. Cambridge: King’s College; 1989.
17. Watkins CJCH, Dayan P. Q-learning. Mach Learn. 1992;8(3–4):279–92.
- View Article
- Google Scholar
18. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al. Playing Atari with deep reinforcement learning. arXiv preprint. 2013;arXiv:1312.5602.
19. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforcement learning. Nature 2015;518(7540):529–33. pmid:25719670
- View Article
- PubMed/NCBI
- Google Scholar
20. Hirashima Y. A new reinforcement learning system for train marshaling with selectable desired layout. IFAC Proc Vol 2011;44(1):6976–81.
- View Article
- Google Scholar
21. Hirashima Y. A reinforcement learning system for transfer scheduling of freight cars in a train. In: Proceedings of the International Multiconference of Engineers and Computer Scientists (IMECS), Hong Kong, China; 2010.
22. Šemrov D, Marsetič R, Žura M, Todorovski L, Srdic A. Reinforcement learning approach for train rescheduling on a single-track railway. Trans Res B: Methodolog. 2016;86:250–67.
- View Article
- Google Scholar
23. Peer E, Menkovski V, Zhang Y, Lee W-J. Shunting trains with deep reinforcement learning. In: 2018 IEEE international conference on Systems, Man, and Cybernetics (SMC); 2018. p 3063–8. https://doi.org/10.1109/smc.2018.00520
24. Lee W, Jamshidi H, Roijers D. Deep reinforcement learning for solving train unit shunting problem with interval timing. Dependable Computing – EDCC 2020 Workshops: AI4RAILS, DREAMS, DSOGRI, SERENE 2020, Munich, Germany; 2020. p. 99–110. https://doi.org/10.1007/978-3-030-58462-7_9
25. Shi J, Chen L, Lin B, Meng G, Xia S. Optimization of shunting operation plan for detaching and attaching trains based on q-learning algorithm. China Railway Sci 2022;43(1):163–70.
- View Article
- Google Scholar
26. Sutton RS, Barto AG. Reinforcement learning: An introduction. Cambridge: MIT Press; 2018.
27. Naeem M, Rizvi STH, Coronato A. A gentle introduction to reinforcement learning and its application in different fields. IEEE Access. 2020;8:209320–44.
- View Article
- Google Scholar
28. Ha V, Vinh V. Experimental research on avoidance obstacle control for mobile robots using q-learning (QL) and deep q-learning (DQL) algorithms in dynamic environments. Actuators 2024;13(1):26.
- View Article
- Google Scholar
29. Park S, Han E, Park S, Jeong H, Yun I. Deep Q-network-based traffic signal control models. PLoS One 2021;16(9):e0256405. pmid:34473716
- View Article
- PubMed/NCBI
- Google Scholar
30. Liu Q, Zhai J, Zhang Z, Zhong S, Zhou Q, Zhang P, et al. A survey on deep reinforcement learning. Chinese J Comput 2018;41(1):1–27.
- View Article
- Google Scholar
31. Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double Q-learning. AAAI. 2016;30(1).
- View Article
- Google Scholar
32. Hausknecht, M.; Stone, P. Deep recurrent Q-learning for partially observable MDPs. In: Proceedings of the AAAI conference on artificial intelligence, Arlington, Virginia, USA; 2015. p. 29–37. https://doi.org/10.48550/arXiv.1507.06527
33. Ren Y. Research on optimal dispatching problem of railway shifting yard vehicles. Taiyuan: Taiyuan University of Technology; 2021.

[ref1] 1. Kang F, Wang W. Condition-based maintenance and safety of railway wagons of CHN energy. China Railway. 2021;8:109–15.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Lin B, Duan J, Wang J, Sun M, Peng W, Liu C, et al. A study of the car-to-train assignment problem for rail express cargos in the scheduled and unscheduled train services network. PLoS One 2018;13(10):e0204598. pmid:30303993
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Song J. Algorithm of turning random-ordered wagon groups into ordered wagon groups. J Lanzhou Railway Inst. 1995;14(1):66–72.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref4] 4. Group II, Operations Research Laboratory, Institute of Mathematics, Chinese Academy of Sciences. A preliminary study of mathematical methods in railway shunting problems. Acta Math Appl Sinica. 1978;1(1):91–105.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref5] 5. Song J. Study on optimizing shunting operation plan schemes with analysis and calculation method. J China Railway Soc. 1999;21(3):11–7.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref6] 6. Song J. Railway shunting operation plan. Beijing: China Railway Publishing House; 2000.

[ref7] 7. Gao S, Zhang D. Elimination of inverse-order – A new model of shunting operation. J China Railway Soc. 2003;25(5):1–7.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref8] 8. Wang Y, Xiao Y, Lei Y, Gui W. Automatic compilation method for marshalling coupler plan of trains detaching and attaching based on BST. China Railway Sci. 2012;33(3):116–22.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref9] 9. Shi J, Li H, Cao H, Zheng Y. Shunting operation planning at EMU depots considering train flexible storage and position occupation. China Railway Sci 2022;43(1):152–62.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref10] 10. Hu ZA, Zheng L, Zhou S. Optimization of shunting operation plan at electric multiple units depot considering train position proper occupancy. J Southwest Jiaotong Univ 2022;57(1):65–73.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref11] 11. Chen T, Wang W, Lv H, Lv M, Liu X. Coordinative planning between 550 technical operation of high-speed railway hub station and shunting operation of 551 EMU depot. J China Railway Soc 2020;42(4):17–26.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref12] 12. Zhang B, Peng Q, Li L, Lu G. Train shunting operation scheduling with number of wagon moves oriented optimization. J China Railway Soc 2020;42(3):11–20.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref13] 13. Han L, Wang Z, Zhang C. Research on intelligent prediction method of shunting operation time in railway centralized traffic control system. Railway Stand Des 2022;66(02):143–8.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref14] 14. Hu J, Niu H, Carrasco J, Lennox B, Arvin F. Voronoi-based multi-robot autonomous exploration in unknown environments via deep reinforcement learning. IEEE Trans Veh Technol 2020;69(12):14413–23.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref15] 15. Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects. Science 2015;349(6245):255–60. pmid:26185243
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref16] 16. Watkins CJ. Learning from delayed rewards. Cambridge: King’s College; 1989.

[ref17] 17. Watkins CJCH, Dayan P. Q-learning. Mach Learn. 1992;8(3–4):279–92.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref18] 18. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al. Playing Atari with deep reinforcement learning. arXiv preprint. 2013;arXiv:1312.5602.

[ref19] 19. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforcement learning. Nature 2015;518(7540):529–33. pmid:25719670
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref20] 20. Hirashima Y. A new reinforcement learning system for train marshaling with selectable desired layout. IFAC Proc Vol 2011;44(1):6976–81.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref21] 21. Hirashima Y. A reinforcement learning system for transfer scheduling of freight cars in a train. In: Proceedings of the International Multiconference of Engineers and Computer Scientists (IMECS), Hong Kong, China; 2010.

[ref22] 22. Šemrov D, Marsetič R, Žura M, Todorovski L, Srdic A. Reinforcement learning approach for train rescheduling on a single-track railway. Trans Res B: Methodolog. 2016;86:250–67.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref23] 23. Peer E, Menkovski V, Zhang Y, Lee W-J. Shunting trains with deep reinforcement learning. In: 2018 IEEE international conference on Systems, Man, and Cybernetics (SMC); 2018. p 3063–8. https://doi.org/10.1109/smc.2018.00520

[ref24] 24. Lee W, Jamshidi H, Roijers D. Deep reinforcement learning for solving train unit shunting problem with interval timing. Dependable Computing – EDCC 2020 Workshops: AI4RAILS, DREAMS, DSOGRI, SERENE 2020, Munich, Germany; 2020. p. 99–110. https://doi.org/10.1007/978-3-030-58462-7_9

[ref25] 25. Shi J, Chen L, Lin B, Meng G, Xia S. Optimization of shunting operation plan for detaching and attaching trains based on q-learning algorithm. China Railway Sci 2022;43(1):163–70.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref26] 26. Sutton RS, Barto AG. Reinforcement learning: An introduction. Cambridge: MIT Press; 2018.

[ref27] 27. Naeem M, Rizvi STH, Coronato A. A gentle introduction to reinforcement learning and its application in different fields. IEEE Access. 2020;8:209320–44.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref28] 28. Ha V, Vinh V. Experimental research on avoidance obstacle control for mobile robots using q-learning (QL) and deep q-learning (DQL) algorithms in dynamic environments. Actuators 2024;13(1):26.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref29] 29. Park S, Han E, Park S, Jeong H, Yun I. Deep Q-network-based traffic signal control models. PLoS One 2021;16(9):e0256405. pmid:34473716
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref30] 30. Liu Q, Zhai J, Zhang Z, Zhong S, Zhou Q, Zhang P, et al. A survey on deep reinforcement learning. Chinese J Comput 2018;41(1):1–27.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref31] 31. Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double Q-learning. AAAI. 2016;30(1).
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref32] 32. Hausknecht, M.; Stone, P. Deep recurrent Q-learning for partially observable MDPs. In: Proceedings of the AAAI conference on artificial intelligence, Arlington, Virginia, USA; 2015. p. 29–37. https://doi.org/10.48550/arXiv.1507.06527

[ref33] 33. Ren Y. Research on optimal dispatching problem of railway shifting yard vehicles. Taiyuan: Taiyuan University of Technology; 2021.

Figures

Abstract

Introduction

Description of the problem

Deep reinforcement learning modeling

Parameter definitions

Reorganization process after the fall-down

Action and state of RL

Reward function design

Problem solving

Case analysis

Computational process

Effectiveness comparison

Case 1.

Case 2.

Discussion

Innovations

Suggestions

Conclusion

References