How to choose the reward function for the cart-pole inverted pendulum task

3k Views Asked by At

I am new in python or any programming language for that matter. For months now I have been working on stabilising the inverted pendulum. I have gotten everything working but struggling to get the right reward function. So far, after researching and trials and fails, the best I could come up with is

R=(x_dot**2)+0.001*(x**2)+0.1*(theta**2)

But I don't get to stability, this being theta=0 long enough.

Does anyone has an idea of the logic behind the ideal reward function?
Thank you.

2

There are 2 best solutions below

1
On

I am working on inverted pendulum too. I found the following reward function which I am trying.

costs = angle_normalise((th)**2 +.1*thdot**2 + .001*(action**2))
# normalize between -pi and pi
reward=-costs

but still have a problem in choosing the actions, maybe we can discuss.

0
On

For just the balancing problem (not the swing-up), even a binary reward is enough. Something like

  • Always 0, then -1 when the pole falls. Or,
  • Always 1, then 0 when the pole falls.

Which one to use depends on the algorithm used, the discount factor and the episode horizon. Anyway, the task is easy and both will do their job.

For the swing-up task (harder than just balancing, as the pole starts upside down and you need to swing it up by moving the cart) it is better to have a reward depending on the state. Usually, the simple cos(theta) is fine. You can also add a penalty for the angle velocity and for the action, in order to prefer slow-changing smooth trajectory. You can also add a penalty if the cart goes out of the boundaries of the x coordinate.
A cost including all these terms would look like this

reward = cos(theta) - 0.001*theta_d.^2 - 0.0001*action.^2 - 100*out_of_bound(x)