Is clipnorm applied before or after momentum in keras?

336 Views Asked by At

In Keras or Tensorflow clipnorm rescales large "gradients" to have a specific norm and clipvalue bounds all the values of the "gradient".

But what happens if you combine one of them with moemntum or something like adam? Is it applied on the gradient or rather on the velocity?

A) Is clipnorm applied on the actual pure mathematical gradient g of the loss with respect to the parameters and then this clipped gradient is used to calculate the update step using the momentum of the old gradients and the learning rate?

velocity = momentum * velocity - learning_rate * clipnorm(g)
w = w + velocity

or

B) First the momentum of the old gradients is combined with the unmodified new gradient. Then the resulting vector (the "velocity") gets scaled by clipnorm.

velocity = clipnorm(momentum * velocity - learning_rate * g)
w = w + velocity

or B')

velocity = momentum * velocity - learning_rate * g
w = w + clipnorm(velocity)

or there would also be the possibility of A')

velocity = momentum * velocity - clipnorm(learning_rate * g)
w = w + velocity

?

A (and A') would suffer from the problem that even though the norm of the gradient is bounded the velocity could get arbitrarily large due to momentum and the clipnorm would make it even slower to break down the velocity or change the direction.

From my perspective B would be the most reasonable, but I don't know how it is actually implemented.

The same question can be analogously asked for clipvalue and adam and other momentum based algorithms.

PS: If clipnorm is not implemented as suggested in B, I would be intrested if there is also a possibility to get B or B' in keras by using a different option?

0

There are 0 best solutions below