Implementing Q-Value Iteration from scratch

1.9k Views Asked by At

I am taking an online course, where I came across submitting many Q-Values. So, I wrote a Python script to calculate it automatically. I have used the following equations. Q-Value Iteration Update Rules But the script is not performing as it should. Its giving wrong answers. Though I could get right answer by doing the same thing on paper.

def Qvalue_iteration(T, R, gamma=0.5, n_iters=10):
    nS = R.shape[0]
    nA = T.shape[0]
    Q = [[0]*nA]*nS # initially
    for _ in range(n_iters):
        for s in range(nS): # for all states s
            for a in range(nA): # for all actions a
                sum_sp = 0
                for s_ in range(nS): # for all reachable states s'
                    sum_sp += (T[a][s][s_]*(R[s][s_][a] + gamma*max(Q[s_])))
                Q[s][a] = sum_sp
    return Q

Here, T are transition probabilities and R is the rewards. Could anyone please help me to code this Q-Value Iteration algorithm from scratch. I am a beginner in Reinforcement Learning. Though I have submitted the answers and got all of it correct by doing on paper, but I want to code it up.

2

There are 2 best solutions below

1
On

@Ashutosh Have you already solved the problem? For which online course was it for?

def Qvalue_iteration(T, R, gamma=0.5, n_iters=10):
nA = R.shape[0]
nS = T.shape[0]
Q = np.zeros((nS,nA)) # initially
for _ in range(n_iters):
    for s in range(nS): # for all states s
        for a in range(nA): # for all actions a
            sum_sp = 0
            for s_ in range(nS): # for all reachable states s'
                sum_sp += (T[s][a][s_]*(R[s][a][s_] + gamma*max(Q[s_])))
            Q[s][a] = sum_sp
return Q

The shape has been re-ordered!

0
On

The code that is creating the error is the line where you initialised the zero matrix:

Q = [[0]*nA]*nS # initially

Instead, you could try importing numpy as np and initialise a zero matrix:

Q = np.zeros((nS,NA))