Implementation of real-time energy management strategy based on reinforcement learning for hybrid electric vehicles and simulation validation

To further improve the fuel economy of series hybrid electric tracked vehicles, a reinforcement learning (RL)-based real-time energy management strategy is developed in this paper. In order to utilize the statistical characteristics of online driving schedule effectively, a recursive algorithm for the transition probability matrix (TPM) of power-request is derived. The reinforcement learning (RL) is applied to calculate and update the control policy at regular time, adapting to the varying driving conditions. A facing-forward powertrain model is built in detail, including the engine-generator model, battery model and vehicle dynamical model. The robustness and adaptability of real-time energy management strategy are validated through the comparison with the stationary control strategy based on initial transition probability matrix (TPM) generated from a long naturalistic driving cycle in the simulation. Results indicate that proposed method has better fuel economy than stationary one and is more effective in real-time control.


Introduction
The hybrid electric vehicles (HEVs) are booming rapidly as a solution to the depletion of fossil fuel and severe pollution of air condition. Due to the cooperation of a battery pack and the internal combustion engine, the vehicle powertrain allows the engine to avoid operating at low load with poor efficiency, and the fuel economy and emission can be improved significantly. However, the flexibility of power split also makes energy management problem more challenging.
Energy management strategy (EMS) plays a crucial role in trade-off among performance, fuel economy and emission of HEVs. Numerous studies have been conducted in the energy management of HEVs [1]. Generally, energy management strategies of HEVs are classified into rule-based and optimization-based control strategy [2,3]. The rule-based strategy is widely used in practice due to the straightforward implementation and high computation efficiency. Jalil proposed a rule-based energy management strategy to determine the power split between the engine and battery by setting thresholds [4]. Trovão presented a new rule-based energy a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 management strategy integrating meta-heuristic optimization for a mutilevel EMS in a electric vehicle [5]. To further improve the performance of energy management system, an adaptive fussy logic controller was used to calibrate the operating points and key parameters to minize the fuel consumption according to the driving cycles [6]. However, the performance of any ruel-based strategy is highly dependent on the proper design of the control rules, which usually depends on the enginenering experience. Therefore, many researchers make more efforts to optimization-based energy strategy.
With a prior knowledge of the driving cycles, dynamic programming (DP) receives a optimal result and determines the best fuel economy. However, the real-time and robust performance of this strategy cannot be guaranteed [7]. Instead, DP is implemented offline and served as a benchmark to explore the potential of fuel economy [8]. To make on-line optimization possible, equivalent consumption minimization strategy (ECMS) and model predictive control (MPC) have been adopted to develop energy management [9,10]. ECMS is calculated based on the assumption that the variation of SOC (state of charge of the batttery) is neligible due to the slow dynamics compared to other dynamics in HEV [11]. The equivalence factor of ECMS has an important effect on the control performance. However, the optimal value should be determined offline according to a specific cycle [12]. MPC is a promising method for dynamic model due to the prediction ability in a finite future time-horizon. A MPC-based strategy is developed by predicting the road slope. The results show that the method not only maintains the battery SOC within its boundary, but also achieves better fuel economy [13]. A Pontryagain's minimum principle (PMP) is used to find the optimal energy management strategy through combining the power prediction based on the traffic information, such as the maximum acceleration, average velocity and maximum velocity [14]. However, the performance of MPC depends on the prediction accuracy heavily, and varying weather conditions and drving styles make it difficult to guarantee the accuracy.
The exsiting mehods mostly considered a single objective to optimize, such as fuel consuption, while disregarding many onther concerns. The convex multicriteria optimization approach was recently harnessed to optimize mutiple objectives of plug-in hybride electric vehicles, including the battery sizing, charging and on-road power management [15]. Ref. [16] studied the optimal tradeoff between the fuel-cell durability and hydrogen economy for a fuelcell hybrid bus. Ref. [17] investgates the interactions among three control tasks, such as charging, on-road power management and battery degradation mitigation in samrt grid environment, aiming to minimize the daily operational expense of a PHEV. A high-efficiency convex programming framework is harnessed to minimize daily CO 2 emissions throgh integrating renewable energy and system-level hybrid powertrain optimization [18].
Several novel algorithms for energy management of HEVs have been developed to realize online optimization for multiple types of HEVs, such as the game theory [19], stochastic dynamic programming (SDP) [20], and reinforcement learning (RL) [21]. Wang optimized the power management problem for a hybrid electric vehicle based on SDP algorithm. However, heavy computation burden makes it difficult to implement online [22]. In numerous areas, RL is a heuristic learning method. Ref. [23] applied RL in the energy management strategy for an electric hybrid tracked vehicle, and compared the performance of RL and SDP. The results indicate that RL algorithm has a better performance and a shorter computation time. However, the RL doesn't update online and fails to maintain the optimal performance when the driving cycles vary.
This paper proposes a real-time energy management strategy for a small series hybrid electric tracked vehicle. A recursive algorithm is developed to compute the transition probability matrix (TPM) of power-request when the new statistical characteristics of online driving cycle are considerd. RL algorithm is applied to obtain the optimal control policy based on updating TPM of power-request at the regular intervals. A facing-forward simulation model is estabalised to evaluate RL-based real-time energy management strategy through the comparison with RL-stationary control strategy. Simulation results show that the proposed method achieves better fuel economy than RL-stationary one and is feasible for real-time control.

Modelling of hybrid electric tracked vehicle 2.1 Vehicle configuration and parameters
The structure of the hybrid electric tracked vehicle is illustrated in Fig 1. An engine-generatorrectifier set (EGS) and a battery pack supply the electricity to dual motors, propelling the sprocket independently. The engine gives 50 kW maximum power and 93 Nm maximum torque within the speed range from 1200 r/min to 6200 r/min. The generator offers 107 Nm maximum torque within the speed range from 0 r/min to 6400 r/min and 40 kW maximum power. The 37.6 Ah lithium-ion phosphate battery pack gives 307 rated voltage. The essential parameters of the major sub-systems are listed in Table 1.

Vehicle and powertrain modelling
The vehicle is taken as a concentrated mass. The dynamical equation of the vehicle is expressed as where F TR means the tractive force; m is the curb weight, and a is the vehicle acceleration; g represents the gravity acceleration; θ means the road slope angle; f is the rolling resistance coefficient; C D is the aerodynamic drag coefficient; A represents the front area of the vehicle; v ave means the average speed of the two tracks and is determined by v ave = (v 1 +v 2 )/2; v 1 , v 2 are the speed of the two tracks, respectively. The demand power to propel the vehicle, denoted by P dem , is calculated by where P dem consists of two parts, straight power and steering power. M is the resisting yaw moment; ω represents the rotational speed; u t is the lateral resistance coefficient; R is the turning radius; L is behalf of the contacting track width, and B means the tread of the vehicle; u max is the maximum steering resistance coefficient with the radius of braking steering, R = B/2. To analyze and evaluate the EGS's fuel economy, the equivalent electric circuit is established in Fig 2, which consists of the diesel engine, permanent magnet generator, and rectifier. The output voltage and electromagnetic torque, U g and T g of the generator are determined as [24] where ω g is the angular velocity; K e ω g represents the electromotive force; K x is the equivalent resistance coefficient and K x = 3PL g /π; L g is the synchronous inductance of the armature, and P is the number of poles; I g and U g are the current and output voltage of the generator.
The fuel consumption of the vehicle is determined by the engine torque T eng and speed n eng . The speed of the engine is determined as follows. : where n g , n eng denote the rotational speed of the generator and engine, respectively; J eng and J g are the moments of inertia for the engine and generator; T eng is the torque of engine and T g is the electromagnetic torque of the generator.
An internal-resistance model is used to reflect SOC dynamics as [25] dSOC dt where C b is the battery capacity; I b means the battery current which is calculated by where V OC is the open circuit voltage; R int means the internal resistance of battery and P b is the output power of the battery.

Optimal energy management problem formulation
Cost function is the trade-off of the fuel consumption and SOC variation, ensuring that the final SOC of battery stays at the same level of the initial value, expressed as follows where β represents the penalty factor, which is a positive weighting factor; fuel is the fuel consumption at time t. The rate of flowing fuel mass fuel Á ðtÞ is determined by engine torque T eng and rotational speed n eng based on BSFC (braked specific fuel consumption) map, typically obtained through a bench test. The engine's torque is used to regulate the power split between the EGS and battery to minimize the total fuel consumption.

Online updating transition probability matrix (TPM)
The power-request of the vehicle is calculated according to the Eq (2). Using the maximum likelihood estimation and nearest method, the transition probability matrix of the powerrequest is described as where N ij denotes the transition numbers from state x i to state x j , and N oi is the total transition numbers initiated from state x i ; k means the number of transition; F ij is the total frequency rate of transition event f ij from state x i to state x j and F oi is the frequency rate of transition event f i Then X k t¼1 f ij ðtÞ = N ij (k) and X k t¼1 f i ðtÞ = N i (k). The frequency rate F ij and F i are calculated as follows: 8 > > > > < > > > > : and the recursive expressions are deduced [26].
where 1/k is replaced by a constant ψ ranging from 0 to 1, namely forgetting factor to determine the effective memory depth of historic driving cycle, and the forgetting factor ψ is set to be 0.01 in this paper. By substituting the expression (10) into (8), the recursive algorithm of TPM is derived for online learning as follows:

RL-based real-time energy management strategy
The driving schedule is considered as a finite Markov decision process, which comprises a set of state variables s t 2S = {SOC(t),n g (t)|0.6 SOC(t) 0.9, 0 n g (t) 6400}, a set of actions a t 2A = {T eng (t)|0 T eng (t) 93}, a reward function r t 2Reward = {fuel(s t , a t )}.
Corresponding to the state s and action a, the optimal value of the state s t is defined as the expected finite discounted sum of the rewards, as follows [27]: where π is a control policy, γ2[0, 1] is a discount factor; p means the probability of the occurrence of a transition from state s t to s t ' under action a t . And the optimal control policy can be decided by the function In addition, the Q value Q(s t , a t ) and optimal value Q Ã corresponding to state s t and action a t are defined as follows Finally, the updating rule of Q-learning and optimal control strategy are defined as [28,29]: 8 < : where γ2[0, 1], α2[0, 1] are respectively discount factor and decayed factor in Q-learning. The computational flowchart of RL is listed in the pseud code format as shown in Fig 3. Through adopting Q-learning algorithm above, the stationary control policy is derived based on initial TPM generated from a long naturalistic driving cycle as shown in Fig 4. This paper proposes a real-time energy management strategy, aiming at improving the adaptability to change of power-request characteristics. Firstly, the updating TPM of power-request is acquired according to Eq (11) in real-time. Then at set intervals, the control policy is

Simulation and validation
The simplified electric coupling model of powertrain for hybrid electric tracked vehicle is given in Fig 6, where the power provided from EGS and battery should keep balance with the power-request of two driving motors. Based on the electric coupling model and powertrain model in Section 2, a facing-forward detailed model for the hybrid electric tracked vehicle is established in the Simulink environment, consisting of the engine-generator model, power battery pack, two motors, vehicle dynamic model and controller as shown in Fig 7. The proportional-integral driver model is adopted to adjust the toques of both motors to follow the target driving cycles. The controller determines the throttle according to the control map obtained through RL algorithm when the states of vehicle feedbacks to the controller. The RLbased real-time energy management strategy and stationary energy management strategy are applied into the facing-forward simulation model for the same driving cycle, respectively. The control map remains unchanged in the case of stationary strategy while the control map is updated every 300s in the case of real-time strategy. The initial values of the state variables n g and SOC are taken as 1200 rpm and 0.75, respectively.   close to initial SOC value due to the final constraints for SOC value. During the first 300 seconds, two control strategies are based on the same TPM, so SOC trajectories change almost in the same way. However, when the driving condition changes, the control policy is recalculated at 300 s for real-time control strategy. And the same process is triggered at 600 s and 900 s. Similarly, the power split changes correspondingly as shown in Fig 10. For the sake of eliminating the influence of the deviation between the final SOC on fuel consumption, an SOC-correction method [30] is utilized to compensate for the fuel consumption. Table 3 shows the fuel consumption of two methods after SOC correction. The fuel consumption of RL-based realtime strategy is 6% lower than that of stationary strategy. Fig 11 depicts the working points of Implementation of real-time energy management for HEVs the engine for the experimental driving schedule. There are more points in the optimal fuel consumption area in the real-time strategy than that of stationary one.
To further explore the adaptability and robustness of RL-based real-time strategy, the same procedure is performed for the validation driving schedule in the field test (shown in Fig 12). The SOC trajectories and working points of engine are shown in Fig 13 and Fig 14. Because of the constraint for SOC value, the final SOC values are close to the initial SOC value. Table 4 lists the fuel consumption after SOC correction. Due to utilizing the newest characteristics of driving schedule, the RL-based real-time control strategy can reduce about 8% fuel consumption than stationary one. The reason why the fuel improvement is bigger in the later simulation  Implementation of real-time energy management for HEVs than the former is that the driving condition during the whole validation driving schedule is similar, as presented in Fig 12, and the control policy based on updating TPM at 600 s has already included the statistical information of whole driving schedule. The results suggest that the real-time control strategy has a good performance of robustness and enable to adapt to different driving cycles better.
The three different driving schedules were adopted to validate the robustness of the RLbased real-time energy management strategy again, as shown in Fig 15. Table 5 lists the fuel consumption after SOC-correction. The real-time method, which utilizes the newest driving characteristics, gives superior fuel economy to the stationary one.

Conclusion
This paper proposes a real-time energy management strategy based on reinforcement learning for a hybrid electric tracked vehicle. A recursive algorithm for the transition probability matrix Implementation of real-time energy management for HEVs (TPM) is developed to make use of new statistical characteristics of online driving schedule. Based on the updating transition probability matrix (TPM), the control policy is calculated and updated at regular intervals to adapt to different driving conditions. A detailed facing-forward simulation model including the engine-generator model, battery model and vehicle  Implementation of real-time energy management for HEVs  Implementation of real-time energy management for HEVs dynamical model is built. In order to validate the effectiveness and adaptability of real-time control policy, the simulation for two driving schedules are operated. The results indicate that real-time energy management strategy has a superior fuel economy than stationary one, and is more effective in real-time control requirement.