I‘m studying Reinforcement Learning
and I’m facing a problem understanding the difference between SARSA, Q-Learning, expected SARSA, Double Q Learning and temporal difference. Can you please explain the difference and tell me when to use each? And what is the effect on e-greedy and greedy move?
SARSA :
I’m in state St
, an action is chosen with the help of the policy so it moves me to another state St+1
Depending on the Policy in state St+1
an action is made so my Reward
in St
is gonna be updated due to the expected Reward
in the look ahead state St+1
.
Q(S, A) ← Q(S, A) + α[ R + γQ(S , A ) − Q(S, A)]
Q-Learning:
I’m in state St
, an action was chosen with the help of the policy so it moves me to state St+1
, this time it’s not gonna depend on the policy instead it’s gonna observe the maximum of the expected Reward
(greedy Reward
) in state St+1
and through it the reward of state St
is going to be updated.
Q(S, A) ← Q(S, A) + α [R + γ max Q(S , a) − Q(S, A)]
Expected SARSA:
it’s gonna be same as Q-learning instead of updating my Reward
with the help of the greedy move in St+1
I take the expected reward of all actions :
Q(St , At) ← Q(St , At) + α[Rt+1 + γE[Q(St+1, At+1)|St+1] − Q(St , At)]
Temporal difference :
The current Reward
is gonna be updated using the observed reward Rt+1
and the estimate value V(St+1)
At timepoint t + 1
:
V (St) ← V (St) + α[Rt+1 + γV (St+1) − V (St)]
is it true what I got or am I missing something? And What about Double Q Learning?
With 0.5 probabilility:
Q1(S, A) ← Q1(S, A) + α R + γQ2 S , argmaxa Q1(S , a) − Q1(S, A)
else:
Q2(S, A) ← Q2(S, A) + α R + γQ1 S , argmaxa Q2(S , a) − Q2(S, A)
Can someone explain it please!!