I keep thinking that I am about to understand custom gradients but then I test it out this example and I can just not figure out what is going on. I am hoping somebody can walk me through what exactly is happening below. I think this essentially is down to me not understanding specifically what "dy" is in the backward function.
v = tf.Variable(2.0)
with tf.GradientTape() as t:
x = v*v
output = x**2
print(t.gradient(output, v))
**tf.Tensor(32.0, shape=(), dtype=float32)**
Everything is good here and the gradient is as one would expect. I then test out this example using custom gradients which (given my understanding) could not possibly affect the gradient given I have put in this massive threshold in clip_by_norm
@tf.custom_gradient
def clip_gradients2(y):
def backward(dy):
return tf.clip_by_norm(dy, 20000000000000000000000000)
return y**2, backward
v = tf.Variable(2.0)
with tf.GradientTape() as t:
x=v*v
output = clip_gradients2(x)
print(t.gradient(output, v))
tf.Tensor(4.0, shape=(), dtype=float32)
But it is reduced to 4, so this is somehow having an effect. How exactly is this resulting in a smaller gradient?
When writing a custom gradient, you must define the whole derivative calculation by yourself. Without your custom gradient, we have the following derivative:
When you override your gradient calculation, you only have
You need to calculate the derivative in your function, i.e:
To get the desired behavior.