In the context of Reinforcement Learning, specifically the Deep Q Learning algorithm,

the online network is trained by minimizing the loss function between Q-values predicted by the online network and target Q-values outputted from the target network.

I'm confused about the stability of this training process, given that the target network is initially initialized with arbitrary weights identical to the online network.

How does this arbitrary initialization or the not meaningful target Q-values used in training?!!!

the training process based on not meaningful target values

0

There are 0 best solutions below