Fig 1.
Sketch of the inverted pendulum.
Fig 2.
The state of environment s is measured and given to the agent. The agent updates it policy and accordingly choose the action a for the next step among −U, 0 or U. After sampling time, the state s evolves and the cycle continues. In our case, . The policy is updated by updating the action-value function for Q-Learning and DQN, using a so-called Q-table and artificial neural networks, respectively.
Fig 3.
The experimental setup.
Fig 4.
Learning results using the basic tabular Q-learning implementation.
Left: Normalized return as a function of the number of time steps for different total number of episodes NT. Right: Temporal evolution of cos θ in the best episode of the longest learning process (NT = 107). The observation space is discretized homogeneously into different number of bins: a) and b) nBins = (10, 10, 10, 10, 10), c) and d) nBins = (50, 50, 50, 10, 10).
Fig 5.
Experimental results with the best policies in inference for two different applied voltages U = 2.4 V (blue) and 7.1 V (green): a) Temporal evolution of the cart’s position x during one episode b) trajectory of the cart in the space c) Temporal evolution of the pendulum’s angle θ during one episode d) Trajectory of the pendulum in the
space e) Temporal evolution of the applied voltage during the first 200 time steps.
Fig 6.
Influence of the applied voltage on the learning process.
Thin curves correspond to different simulations, while thick curves refer to the experimental observations. a) Learning curve. b) Inference curve built from inferences performed every 5000 time steps. c) Temporal evolution of the reward for the episode initiated at θ0 = 0 following the best learned policy. d) Statistics over 10 episodes initiated with θ0 between −10° and 10° of the plateau reward following the best learned policy.
Fig 7.
Influence of the physical parameters on the control: Inference curves of a) static friction, b) viscous friction, c) measurement noise and d) action noise.