I am trying to plot the overestimation bias of the critics in DDPG and TD3 models. So essentially there is a critic_target and a critic network. I want to understand how does one go about finding the overestimation bias of the critic with the true Q value? and also how to find the true Q value?
I see in the original TD3 paper (https://arxiv.org/pdf/1802.09477.pdf) that the author measures the overestimation bias of the value networks. Can someone guide me in plotting the same during the training phase of my actor-critic model?
Answering my own question: Essentially during the training phase at each evaluation period (example: every 5000 steps) we can call a function to do this which performs as follows. Keep in mind the policy is kept fixed throughout this run.
pseudocode is as follows