Figures
Abstract
In order to realize the optimization of autonomous navigation of mobile robot under the condition of partial environmental knowledge known. An improved Q-learning reinforcement learning algorithm based on prior knowledge is proposed to solve the problem of slow convergence and low learning efficiency in mobile robot path planning. Prior knowledge is used to initialize the Q-value, so as to guide the agent to move toward the target direction with a greater probability from the early stage of the algorithm, eliminating a large number of invalid iterations. The greedy factor ε is dynamically adjusted based on the number of times the agent successfully reaches the target position, so as to better balance exploration and exploitation and accelerate convergence. Simulation results show that the improved Q-learning algorithm has a faster convergence rate and higher learning efficiency than the traditional algorithm. The improved algorithm has practical significance for improving the efficiency of autonomous navigation of mobile robots.
Citation: Shi Z, Wang K, Zhang J (2023) Improved reinforcement learning path planning algorithm integrating prior knowledge. PLoS ONE 18(5): e0284942. https://doi.org/10.1371/journal.pone.0284942
Editor: Xiangjie Kong, Zhejiang University of Technology, CHINA
Received: November 12, 2022; Accepted: April 11, 2023; Published: May 4, 2023
Copyright: © 2023 Shi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper.
Funding: This work was funded by the Science and Technology Research Project of Education Department of Hubei Province, the system number is Q20201804; Scientific Research Program of Key Laboratory of Automotive Power Train and Electronics (Hubei University of Automotive Technology),the system number is ZDK1202002.The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
In recent years, with the popularity of artificial intelligence, mobile robot technology has also developed rapidly, and its application fields have become more and more extensive. How to realize the autonomous navigation of mobile robot in a given environment is the key to realize its function [1–3]. An important part of mobile robot research is path planning, i.e., searching for an optimal or nearly optimal collision-free path from the initial state to the target state according to a certain performance index [4]. Path planning can be either global or local. Global path planning depends on a robot’s grasp of global knowledge in the operating environment. Global path planning algorithms include the A* algorithm [5–7], visibility graph [8–10], cell decomposition [11] and Dijkstra algorithm [12]. Local path planning depends on real-time information from sensors. Typical algorithms include the artificial potential field method [13], genetic algorithm [14], neural network method [15], and reinforcement learning (RL) algorithm [16]. RL has received much attention because it can find the optimal path through trial and error in an unknown environment, RL is applied in some industrial fields, such as flow-shop scheduling [11, 17].
Q-learning is one of the most widely and successfully applied RL algorithms in mobile robot path planning [18, 19], but it faces two challenges: (1) the RL agent searches blindly in the early stage of the algorithm, resulting in too many invalid iterations; and (2) it is difficult to balance exploration and exploitation. Wen et al. [20] proposed a method of initializing the Q-value based on fuzzy logic in the early stage, so the agent no longer blindly selects an action, which decreases invalid iterations and speeds up the algorithm. Viet et al. solved the path planning problem with the Dyna-Q-learning algorithm, which combines the Dyna learning framework and Q-learning algorithm. Both of the above approaches increase the complexity of the algorithm.
This paper proposes an improved Q-learning algorithm that: (1) combines the prior knowledge to initialize the Q-value, where the closer to the target location the larger the Q-value, so that the agent can search toward the target location in the early stage, eliminating a large number of invalid iterations; and (2) dynamically adjusts the greedy factor ε according to the number of successful arrivals at the target location. In the early stage of the algorithm, ε is large and the agent tends to explore the environment, then ε is gradually decreased, and the agent tends to make use of high-quality actions, so as to better balance exploration and exploitation. It is expected that the improved algorithm can solve the shortcomings of the general Q-learning algorithm applied to the path planning of mobile robots, such as slow convergence speed, long iteration time, and difficulty in balancing the relationship between exploration and utilization.
The rest of this paper is organized as follows. RL and traditional Q-learning are introduced in section II. Section III proposes improvements to traditional Q-learning. Section IV presents experimental results and analysis, and section V presents our conclusions.
2. Related work
2.1 Reinforcement learning
RL is different from supervised and unsupervised learning. It realizes the optimal solution of sequential decision-making problems through an agent interacting with the environment. An RL problem can be represented by a Markov decision process [21], which is defined as a 5-tuple (S, A, P, R, γ), where S is a set of states, A is a set of actions, P is the state transition probability, R is a reward function, and γ is a discount factor. The process of RL is illustrated in the diagram in Fig 1.
An agent selects action at in A to interact with an environment, which transitions from state st to state st+1 and returns a reward rt+1 to the agent. RL aims to determine the optimal strategy to maximize the cumulative reward in a given environment. For a given policy, the cumulative reward Gt is defined to quantify the value of the state:
(1)
where γ∈[0,1] is a variable that determines the way future rewards are valued. A larger γ means that future rewards are more important.
To evaluate the state and action, the state value function V(s) and state-action value function Q(s, a) are introduced as follows:
(2)
(3)
Since both states and actions occur randomly with a certain probability, the above two equations are expressed by the expected values. Their Bellman equations are as follows:
(4)
(5)
The optimal state value function and optimal state-action value functions can be respectively expressed as:
(6)
(7)
where π is the policy.
Q*(st,at) is the cumulative reward corresponding to the optimal action in the optimal policy. Therefore,
(8)
and the optimization problem of reinforcement learning can be expressed as
(9)
where π*(at|st) is the optimal policy for the RL agent.
2.2 Q-learning
Q-learning is an offline temporal difference RL algorithm [15] that follows a different policy when selecting an action than when updating it. To speed up the update process of the state-action value function, Q-learning directly selects the maximum state-action value function corresponding to the next state to participate in the update, expressed as
(10)
where α∈(0,1) is the learning rate.
The Q-learning algorithm is given as follows:
- Set the discount factor γ, learning rate α, and reward r;
- Initialize matrix Q;
- For each episode:
- Select an initial state s;
- While the goal state has not been reached:
- Select an action a in action set A based on a policy;
- Perform a and reach the next state s’ to obtain reward r;
- Update matrix Q;
- Set s’ as s;
- Until s is the goal state.
3. Improved Q-learning path planning algorithm
3.1 Initialization of Q-value function
In the traditional Q-learning algorithm, the Q-value is initialized to 0 or a random number, which leads to blind selection at the early stage, resulting in a large number of invalid iterations. For this problem, we combine the prior knowledge to initialize the Q-value. This study aims to investigate the situation in which the initial position and target position of the mobile robot are known, and the position of the obstacle is unknown. We use this known environmental information to initialize the Q-value.
In a grid map, each grid represents a state. The state value V(s’) of each state in the environment is obtained as
(11)
where ζ is a positive coefficient, ρ(s’) is the distance from state s to the target position, and ρmax is the diagonal length of the grid map.
Then the Q-value is initialized according to the relationship between the state action value and state value as follows:
(12)
where P(s’|s,a) is the probability of transferring to state s’ when the current state s and action a are determined.
According to this method, the Q-value is initialized, and the known environmental information is fully used, so that the closer the target the larger the Q-value. In the early stage, the agent will move toward the target with greater probability, decreasing the number of invalid iterations and accelerating convergence.
3.2 Dynamic adjustment of greedy factor ε
In traditional Q-learning, the ε-greedy policy is applied to select an action. The agent explores the environment with probability ε, and exploits the optimal action with probability 1-ε. Usually, ε takes a smaller value to ensure that better actions can somewhat balance exploration and utilization. Usually, a small ε can balance the contradiction between exploration and exploitation to a certain extent. However, when the environment is complex and the agent explores the environment with a small probability, the agent cannot be guaranteed to fully explore the environment in a limited time. To solve this problem, a dynamic adjustment strategy with greedy factor εis proposed, where ε is dynamically adjusted as
(13)
where 0 < ε3 < ε2 < ε1 < 1, C is the number of times the agent successfully reaches the target position in the current episode, C1 and C2 are integers, and C1 < C2.
Due to the agent’s lack of environmental knowledge, early in the algorithm it has difficulty reaching the target location, and a large ε can accelerate its exploration of the environment. With the progress of the algorithm, the agent has a certain understanding of the environment, the number of times it reaches the target position increases, ε is decreased, and the exploitation of the optimal action is increased. When the algorithm reaches a certain stage and the number of times it reaches the target position exceeds a certain threshold, a small ε is set and the fast convergence of the algorithm is guaranteed.
4 Experiments and analysis of results
4.1 Experimental design
To evaluate the performance of the improved algorithm, a 20×20 grid map is built based on the Python library Tkinter, as shown in Fig 2.
The size of each grid is 20 × 20 pixels. A square represents the agent, a triangle represents an obstacle, a white grid represents a barrier-free area, the circle represents the target position, each grid represents one of 400 states, the starting point, i.e., state (1, 1), is set at position (10, 10), and the target is set at position (17, 11). The starting point and the target positions are known environmental information, and the obstacle position is unknown. When the agent encounters an obstacle or reaches the target position, an episode ends. At the end of each episode, the agent is put back at the starting position to start the next episode.
4.2 Experimental parameter settings
The following four algorithms are compared in the simulation environment: Q-learning (original) represents the traditional Q-learning algorithm; Q-learning (Initialization) represents an improved algorithm that integrates prior knowledge to initialize the Q value; Q-learning (Dynamic) represents the use of greedy factor dynamic adjustment strategy instead of ε-greedy strategy to improve the algorithm; Q-learning (Improved) represents the final improved algorithm proposed in this paper.
In Table 1, √ indicates that the corresponding algorithm improvement is introduced into the traditional Q-learning algorithm, and × indicates that the corresponding algorithm is not introduced.
The parameters of the traditional Q-learning are set as follows: ɑ = 0.01, γ = 0.9, ε = 0.2, the maximum number of episodes is 20000, and the reward function is
(14)
The ɑ, γ, ε, maximum number of episodes, and reward function are set to be the same in both the proposed improved Q-learning and the traditional Q-learning. Other parameters are ζ = 1, ε1 = 0.5, ε2 = 0.2, ε3 = 0.05, C1 = 1, C2 = 50.
4.3 Experimental results and analysis
The experimental results show that both algorithms can find the optimal path after a certain number of episodes. Fig 3 shows the optimal path of the proposed algorithm.
Figs 4–7 respectively show the change of steps of four algorithms mentioned above include traditional and improved Q-learning. When the number of steps fluctuates in a small range, the algorithm is considered to be convergent. Traditional Q-learning can converge in about 9500 episodes, and there are many steps in each episode before that. The improved Q-learning algorithm can converge in about 4500 episodes, with fewer running steps in each episode before convergence.
The convergence conditions are set so that that the standard deviation of the number of running steps in 10 consecutive episodes is less than 5. Each algorithm runs 20 times and takes the average value to obtain the data shown in Table 2.
By analyzing the data in Table 2, it can be seen that, compared with the traditional Q-learning algorithm, the running time before convergence of the improved Q-learning algorithm is reduced by 53.1%, the total number of steps before convergence is reduced by 73.5%, and the number of episodes before convergence is reduced by 50.7%,these data are better than the middle two algorithms. These aspects formally reflect that the improved algorithm is effective and has practical significance for robot path planning.
5 Conclusion
We proposed an improved Q-learning path planning algorithm. The Q-value is initialized by combining prior knowledge, which eliminates a large number of invalid iterations at the early stage of the algorithm and makes the agent move toward the target with greater probability at the beginning. The greedy factor is dynamically adjusted based on the number of times the agent successfully reaches the target position, which better balances exploration and exploitation. Simulation results based on a grid map show that the improved algorithm is more efficient and faster than the traditional algorithm. However, many parameters in the improved algorithm must be manually set according to the environment. How to more efficiently set the parameters is the focus of our next work.
References
- 1. Viet H. H., An S. H., and Chung T. C. J. A. R., "Dyna-Q-based vector direction for path planning problem of autonomous mobile robots in unknown environments," vol. 27, no. 3, pp. 159–173, 2013.
- 2. Lou P, Xu K, Jiang X, et al. Path planning in an unknown environment based on deep reinforcement learning with prior knowledge[J]. Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology, 2021(6):41.
- 3. Pan Z, Lei D, Wang L. A Knowledge-Based Two-Population Optimization Algorithm for Distributed Energy-Efficient Parallel Machines Scheduling[J]. IEEE Transactions on Cybernetics, 2020, PP(99).
- 4. Lv L., Zhang S., Ding D., and Wang Y. J. I. A., "Path Planning via an Improved DQN-based Learning Policy," vol. PP, no. 99, pp. 1–1, 2019.
- 5. Hart P. E., Nilsson N. J., B. J. I T. o. S S. Raphael, and Cybernetics, "A Formal Basis for the Heuristic Determination of Minimum Cost Paths," vol. 4, no. 2, pp. 28–29, 1972.
- 6. Wang J J, Wang L. A cooperative memetic algorithm with feedback for the energy-aware distributed flow-shops with flexible assembly scheduling[J]. Computers & Industrial Engineering, 2022, 168:108126–.
- 7. Wang Z. and Xiang X., "Improved Astar Algorithm for Path Planning of Marine Robot," 2018, pp. 5410–5414.
- 8. Sombolestan S M, Rasooli A, Khodaygan S. Optimal path-planning for mobile robots to find a hidden target in an unknown environment based on machine learning[J]. Journal of Ambient Intelligence and Humanized Computing, 2018.
- 9. Blasi L., D’Amato E., Mattei M., and Notaro I. J. A. S., "Path Planning and Real-Time Collision Avoidance Based on the Essential Visibility Graph," vol. 10, no. 16, p. 5613, 2020.
- 10. Zhao F, Shao D, Xu T, et al. An ensemble discrete water wave optimization algorithm for the blocking flow-shop scheduling problem with makespan criterion[J]. Applied Intelligence, 2022.
- 11. Cai C., Ferrari S. J. I. T. o. S. M., Man C. P. B. C. A. P. o. t. I. S., and C. Society, "Information-driven sensor path planning by approximate cell decomposition," vol. 39, no. 3, pp. 672–89, 2009.
- 12. Chao Y. and Wang H., "Developed Dijkstra shortest path search algorithm and simulation," in 2010 International Conference On Computer Design and Applications, 2010.
- 13.
Kathib O., Real-Time Obstacle Avoidance for Manipulators and Mobile Robots. Springer New York, 1986.
- 14. Pan Y., Yang Y., and Li W. J. I. A., "A Deep Learning Trained by Genetic Algorithm to Improve the Efficiency of Path Planning for Data Collection With Multi-UAV," vol. 9, pp. 7994–8005.
- 15. Sung I., Choi B., and Nielsen P. J. I. J. o. I. M, "On the training of a neural network for online path planning with offline path planning algorithms," vol. 57, p. 102142, 2020.
- 16. Liu Q., Shi L., Sun L., Li J., Ding M., and Shu F. J. I. T. o. V. T, "Path Planning for UAV-Mounted Mobile Edge Computing With Deep Reinforcement Learning," vol. 69, no. 5, pp. 5723–5728, 2020.
- 17. Zhao F, Zhao L, Wang L, et al. An Ensemble Discrete Differential Evolution for the Distributed Blocking Flowshop Scheduling with Minimizing Makespan Criterion[J]. Expert Systems with Applications, 2020, 160:113678.
- 18. Li S., Xin X., and Lei Z., "Dynamic path planning of a mobile robot with improved Q-learning algorithm," in 2015 IEEE International Conference on Information and Automation, 2015.
- 19. Zhao M., Lu H., Yang S., and Guo F. J. I. A., "The Experience-Memory Q-Learning Algorithm for Robot Path Planning in Unknown Environment," vol. PP, no. 99, pp. 1–1, 2020.
- 20. Wen S., Chen J., Li Z., Rad A. B., and Othman K. M., "Fuzzy Q-learning obstacle avoidance algorithm of humanoid robot in unknown environment," in 2018 37th Chinese Control Conference (CCC), 2018.
- 21. Kober J., Bagnell J. A., and Peters J. J. I. J. o. R. R, "Reinforcement Learning in Robotics: A Survey," 2013.