I've been playing around with automatic gradients in tensorflow and I had a question. If we are updating an optimizer, say ADAM, when is the momentum algorithm applied to the gradient? Is it applied when we call tape.gradient(loss,model.trainable_variables) or when we call model.optimizer.apply_gradients(zip(dtf_network,model.trainable_variables))?
Thanks!
tape.gradient
computes the gradients straightforwardly without reference to an optimizer. Since momentum is part of the optimizer, the tape does not include it. AFAIK momentum is usually implemented by adding extra variables in the optimizer that store the running average. All of this is handled inoptimizer.apply_gradients
.