I have taken some reference implementations of PPO algorithm and am trying to create an agent which can play space invaders . Unfortunately from the 2nd trial onwards (after training the actor and critic N Networks for the first time) , the probability distribution of the actions converges on only action and the PPO loss and the critic loss converges on only one value.
Wanted to understand the probable reasons why this might occur . I really cant run the code in my cloud VMs without being sure that I am not missing anything as the VMs are very costly to use . I would appreciate any help or advice in this regarding .. if required I can post the code as well . Hyperparameters used are as follows :
clipping_val = 0.2 critic_discount = 0.5 entropy_beta = 0.001 gamma = 0.99 lambda = 0.95
One of the reasons could be that you are not normalising the inputs to the CNN in the range [0,1] and thus saturating your neural networks. I suggest that you use the preprocess() function in your code to transform your states(inputs) to the network.