Application of Welford algorithm to PPO agent training

17 Views Asked by Ftoso91 At 20 March 2024 at 11:22

A state-of-the-art approach to normalize observations in the context of PPO agent training is to use running statistics (mean and standard deviation), which are continuously updated during the training using the Welford algorithm.

However I have not found any reference about the specific details of the updating frequency, in particular:

Should the running statistics be reset every time a new episode begins? (I would say no, but I'm not sure)
In the context of a single episode of training, should I normalize all the observations using the same mean and average? (Which basicslly means, using "target" normalizing statistic which are updated on episode basis, not on sample basis)

I'm not asking which variant is the most efficient (probably difficult to tell in general), but rather which variant is considered state-of-the-art when applied to PPO training

Original Q&A

Application of Welford algorithm to PPO agent training

There are 0 best solutions below

Related Questions in REINFORCEMENT-LEARNING

Related Questions in MOVING-AVERAGE

Trending Questions

Popular # Hahtags

Popular Questions