Why the bandit problem is also called a one-step/state MDP in Reinforcement learning?

787 Views Asked by At

What do we mean by 1 step/state MDP(Markov decision process) ?

2

There are 2 best solutions below

0
On BEST ANSWER

Let us consider a n action 1 state MDP. Regardless of which action you take, you are going to stay in the same state. You will, though, get a reward that depends only on the action you took. If you wish to maximise the long term reward in this setting, what you need to do is just judge which of n available choices (actions) is the best.

This is exactly what the bandit problem is.

0
On

In bandit the past pulls of levers do not affect what the lever will output or the reward.

The reward is only dependent on which lever is pulled and nothing in the past.

So there is only one state.