Application of Welford algorithm to PPO agent training

17 Views Asked by At

A state-of-the-art approach to normalize observations in the context of PPO agent training is to use running statistics (mean and standard deviation), which are continuously updated during the training using the Welford algorithm.

However I have not found any reference about the specific details of the updating frequency, in particular:

  1. Should the running statistics be reset every time a new episode begins? (I would say no, but I'm not sure)

  2. In the context of a single episode of training, should I normalize all the observations using the same mean and average? (Which basicslly means, using "target" normalizing statistic which are updated on episode basis, not on sample basis)

I'm not asking which variant is the most efficient (probably difficult to tell in general), but rather which variant is considered state-of-the-art when applied to PPO training

0

There are 0 best solutions below