In Keras or Tensorflow clipnorm
rescales large "gradients" to have a specific norm and clipvalue
bounds all the values of the "gradient".
But what happens if you combine one of them with moemntum
or something like adam
? Is it applied on the gradient or rather on the velocity?
A) Is clipnorm
applied on the actual pure mathematical gradient g
of the loss with respect to the parameters and then this clipped gradient is used to calculate the update step using the momentum of the old gradients and the learning rate?
velocity = momentum * velocity - learning_rate * clipnorm(g)
w = w + velocity
or
B) First the momentum of the old gradients is combined with the unmodified new gradient. Then the resulting vector (the "velocity") gets scaled by clipnorm.
velocity = clipnorm(momentum * velocity - learning_rate * g)
w = w + velocity
or B')
velocity = momentum * velocity - learning_rate * g
w = w + clipnorm(velocity)
or there would also be the possibility of A')
velocity = momentum * velocity - clipnorm(learning_rate * g)
w = w + velocity
?
A (and A') would suffer from the problem that even though the norm of the gradient is bounded the velocity could get arbitrarily large due to momentum and the clipnorm would make it even slower to break down the velocity or change the direction.
From my perspective B would be the most reasonable, but I don't know how it is actually implemented.
The same question can be analogously asked for clipvalue
and adam
and other momentum based algorithms.
PS: If clipnorm
is not implemented as suggested in B, I would be intrested if there is also a possibility to get B or B' in keras by using a different option?