PPO algorithm converges on only one action

1k Views Asked by At

I have taken some reference implementations of PPO algorithm and am trying to create an agent which can play space invaders . Unfortunately from the 2nd trial onwards (after training the actor and critic N Networks for the first time) , the probability distribution of the actions converges on only action and the PPO loss and the critic loss converges on only one value.

Wanted to understand the probable reasons why this might occur . I really cant run the code in my cloud VMs without being sure that I am not missing anything as the VMs are very costly to use . I would appreciate any help or advice in this regarding .. if required I can post the code as well . Hyperparameters used are as follows :

clipping_val = 0.2 critic_discount = 0.5 entropy_beta = 0.001 gamma = 0.99 lambda = 0.95

1

There are 1 best solutions below

1
On

One of the reasons could be that you are not normalising the inputs to the CNN in the range [0,1] and thus saturating your neural networks. I suggest that you use the preprocess() function in your code to transform your states(inputs) to the network.

def preprocess(self,img):
    width = img.shape[1]
    height = img.shape[0]
    dim = (abs(width/2), abs(height/2))
    resized = cv2.resize(img,(80,105) ) #interpolation = cv2.INTER_AREA)
    resized = resized/255.0 # convert all pixel values in [0,1] range
    resized = cv2.cvtColor(resized, cv2.COLOR_BGR2GRAY)
    resized = resized.reshape(resized.shape+(1,))
    return resized