State value and state action values with policy - Bellman equation with policy

Question

State value and state action values with policy - Bellman equation with policy

2.7k Views Asked by Søren Koch At 18 August 2025 at 08:11

I am just getting start with deep reinforcement learning and i am trying to crasp this concept.

I have this deterministic bellman equation

When i implement stochastacity from the MDP then i get 2.6a

My equation is this assumption correct. I saw this implementation 2.6a without a policy sign on the state value function. But to me this does not make sense due to i am using the probability of which different next steps i could end up in. Which is the same as saying policy, i think. and if yes 2.6a is correct, can i then assume that the rest (2.6b and 2.6c) because then i would like to write the action state function like this:

The reason why i am doing it like this is because i would like to explain myself from a deterministic point of view to a non-deterministic point of view.

I hope someone out there can help on this one!

Best regards Søren Koch

Original Q&A

There are 2 best solutions below

Pablo EM On 23 February 2018 at 20:41

Yes, your assumption is completely right. In the Reinforcement Learning field, a value function is the return obtained by starting for a particular state and following a policy π . So yes, strictly speaking, it should be accompained by the policy sign π .

The Bellman equation basically represents value functions recursively. However, it should be noticed that there are two kinds of Bellman equations:

Bellman optimality equation, which characterizes optimal value functions. In this case, the value function it is implicitly associated with the optimal policy. This equation has the non linear maxoperator and is the one you has posted. The (optimal) policy dependcy is sometimes represented with an asterisk as follows: Maybe some short texts or papers omit this dependency assuming it is obvious, but I think any RL text book should initially include it. See, for example, Sutton & Barto or Busoniu et al. books.
Bellman equation, which characterizes a value function, in this case associated with any policy π:

In your case, your equation 2.6 is based on the Bellman equation, therefore it should remove the max operator and include the sum over all actions and possible next states. From Sutton & Barto (sorry by the notation change wrt your question, but I think it's understable):

**Dennis Soemers** · Accepted Answer

No, the value function V(s_t) does not depend on the policy. You see in the equation that it is defined in terms of an action a_t that maximizes a quantity, so it is not defined in terms of actions as selected by any policy.

In the nondeterministic / stochastic case, you will have that sum over probabilities multiplied by state-values, but this is still independent from any policy. The sum only sums over different possible future states, but every multiplication involves exactly the same (policy-independent) action a_t. The only reason why you have these probabilities is because in the nondeterministic case a specific action in a specific state can lead to one of multiple different possible states. This is not due to policies, but due to stochasticity in the environment itself.

There does also exist such a thing as a value function for policies, and when talking about that a symbol for the policy should be included. But this is typically not what is meant by just "Value function", and also does not match the equation you have shown us. A policy-dependent function would replace the max_{a_t} with a sum over all actions a, and inside the sum the probability pi(s_t, a) of the policy pi selecting action a in state s_t.

State value and state action values with policy - Bellman equation with policy

There are 2 best solutions below

Related Questions in EQUATION

Related Questions in POLICY

Related Questions in REINFORCEMENT-LEARNING

Related Questions in MDP

Related Questions in MARKOV-DECISION-PROCESS

Trending Questions

Popular # Hahtags

Popular Questions