Reinforcement learning approach to control an inverted pendulum: A general framework for educational purposes

doi:10.1371/journal.pone.0280071

Fig 1.

Sketch of the inverted pendulum.

More »

Expand

Fig 2.

RL learning process.

The state of environment s is measured and given to the agent. The agent updates it policy and accordingly choose the action a for the next step among −U, 0 or U. After sampling time, the state s evolves and the cycle continues. In our case, . The policy is updated by updating the action-value function for Q-Learning and DQN, using a so-called Q-table and artificial neural networks, respectively.

More »

Expand

Fig 3.

The experimental setup.

More »

Expand

Fig 4.

Learning results using the basic tabular Q-learning implementation.

Left: Normalized return as a function of the number of time steps for different total number of episodes N_T. Right: Temporal evolution of cos θ in the best episode of the longest learning process (N_T = 10⁷). The observation space is discretized homogeneously into different number of bins: a) and b) nBins = (10, 10, 10, 10, 10), c) and d) nBins = (50, 50, 50, 10, 10).

More »

Expand

Fig 5.

Experimental results with the best policies in inference for two different applied voltages U = 2.4 V (blue) and 7.1 V (green): a) Temporal evolution of the cart’s position x during one episode b) trajectory of the cart in the space c) Temporal evolution of the pendulum’s angle θ during one episode d) Trajectory of the pendulum in the space e) Temporal evolution of the applied voltage during the first 200 time steps.

More »

Expand

Fig 6.

Influence of the applied voltage on the learning process.

Thin curves correspond to different simulations, while thick curves refer to the experimental observations. a) Learning curve. b) Inference curve built from inferences performed every 5000 time steps. c) Temporal evolution of the reward for the episode initiated at θ₀ = 0 following the best learned policy. d) Statistics over 10 episodes initiated with θ₀ between −10° and 10° of the plateau reward following the best learned policy.

More »

Expand

Fig 7.

Influence of the physical parameters on the control: Inference curves of a) static friction, b) viscous friction, c) measurement noise and d) action noise.

More »

Expand