I am implementing a very complex Function in my research, It use Belief Propagation in this layer. I have derived the gradient w.r.t. W(parameter) of this layer, But because its complex, I haven't derived the gradient w.r.t. input_data(the data come from former layer).
I am very confusion about the detail of back propagation. I search a lot about BP algorithm, Some notes says it is ok only to differential w.r.t. W(parameter) and use residual to get gradient ? Your example seems we need also to calculate gradient w.r.t. input data(former layer output). I am confusion? Very typical example is, how to derive gradient w.r.t. input image in convolutional layer?
My network has two layers, Do I need to derive gradient by hand w.r.t. input X in the last layer? (backward need to return gx in order to let BP works to gradient flow to former layer)?
If you do not need the gradient w.r.t. the input, you can omit its computation. In this case, return
None
as the placeholder for the omitted input gradient. Note that, in this case, thegrad
of the input after backprop will be incorrect. If you want to write a Function that can be used in any context (including the case that one wants the gradient w.r.t. the input), you have to compute the gradients w.r.t. all the inputs (except for the case that the Function is not differentiated w.r.t. the input). This is the reason why the built-in functions of Chainer compute gradients for all the inputs.By the way, deriving the gradient w.r.t. the input image of a convolutional layer is simple: apply transposed-convolution (which is called "deconvolution" in Chainer for the historical reason) to the output using the same weight.