is this true ? what about Expected SARSA and double Q-Learning?

408 Views Asked by At

I‘m studying Reinforcement Learning and I’m facing a problem understanding the difference between SARSA, Q-Learning, expected SARSA, Double Q Learning and temporal difference. Can you please explain the difference and tell me when to use each? And what is the effect on e-greedy and greedy move?

SARSA :

I’m in state St, an action is chosen with the help of the policy so it moves me to another state St+1 Depending on the Policy in state St+1 an action is made so my Reward in St is gonna be updated due to the expected Reward in the look ahead state St+1.

Q(S, A) ← Q(S, A) + α[ R + γQ(S , A ) − Q(S, A)]

Q-Learning:

I’m in state St, an action was chosen with the help of the policy so it moves me to state St+1, this time it’s not gonna depend on the policy instead it’s gonna observe the maximum of the expected Reward (greedy Reward) in state St+1 and through it the reward of state St is going to be updated.

Q(S, A) ← Q(S, A) + α [R + γ max Q(S , a) − Q(S, A)]

Expected SARSA:

it’s gonna be same as Q-learning instead of updating my Reward with the help of the greedy move in St+1 I take the expected reward of all actions :

Q(St , At) ← Q(St , At) + α[Rt+1 + γE[Q(St+1, At+1)|St+1] − Q(St , At)]

Temporal difference :

The current Reward is gonna be updated using the observed reward Rt+1 and the estimate value V(St+1) At timepoint t + 1:

V (St) ← V (St) + α[Rt+1 + γV (St+1) − V (St)]

is it true what I got or am I missing something? And What about Double Q Learning?

With 0.5 probabilility:

Q1(S, A) ← Q1(S, A) + α R + γQ2 S , argmaxa Q1(S , a)  − Q1(S, A)  

else:

Q2(S, A) ← Q2(S, A) + α R + γQ1 S , argmaxa Q2(S , a)  − Q2(S, A)  

Can someone explain it please!!

0

There are 0 best solutions below