Could someone give a clear explanation of backpropagation for LSTM RNNs? This is the type structure I am working with. My question is not posed at what is back propagation, I understand it is a reverse order method of calculating the error of the hypothesis and output used for adjusting the weights of neural networks. My question is how LSTM backpropagation is different then regular neural networks.
I am unsure of how to find the initial error of each gates. Do you use the first error (calculated by hypothesis minus output) for each gate? Or do you adjust the error for each gate through some calculation? I am unsure how the cell state plays a role in the backprop of LSTMs if it does at all. I have looked thoroughly for a good source for LSTMs but have yet to find any.
That's a good question. You certainly should take a look at suggested posts for details, but a complete example here would be helpful too.
RNN Backpropagaion
I think it makes sense to talk about an ordinary RNN first (because LSTM diagram is particularly confusing) and understand its backpropagation.
When it comes to backpropagation, the key idea is network unrolling, which is way to transform the recursion in RNN into a feed-forward sequence (like on the picture above). Note that abstract RNN is eternal (can be arbitrarily large), but each particular implementation is limited because the memory is limited. As a result, the unrolled network really is a long feed-forward network, with few complications, e.g. the weights in different layers are shared.
Let's take a look at a classic example, char-rnn by Andrej Karpathy. Here each RNN cell produces two outputs
h[t]
(the state which is fed into the next cell) andy[t]
(the output on this step) by the following formulas, whereWxh
,Whh
andWhy
are the shared parameters:In the code, it's simply three matrices and two bias vectors:
The forward pass is pretty straightforward, this example uses softmax and cross-entropy loss. Note each iteration uses the same
W*
andh*
arrays, but the output and hidden state are different:Now, the backward pass is performed exactly as if it was a feed-forward network, but the gradient of
W*
andh*
arrays accumulates the gradients in all cells:Both passes above are done in chunks of size
len(inputs)
, which corresponds to the size of the unrolled RNN. You might want to make it bigger to capture longer dependencies in the input, but you pay for it by storing all outputs and gradients per each cell.What's different in LSTMs
LSTM picture and formulas look intimidating, but once you coded plain vanilla RNN, the implementation of LSTM is pretty much same. For example, here is the backward pass:
Summary
Now, back to your questions.
The are shared weights in different layers, and few more additional variables (states) that you need to pay attention to. Other than this, no difference at all.
First up, the loss function is not necessarily L2. In the example above it's a cross-entropy loss, so initial error signal gets its gradient:
Note that it's the same error signal as in ordinary feed-forward neural network. If you use L2 loss, the signal indeed equals to ground-truth minus actual output.
In case of LSTM, it's slightly more complicated:
d_next_h = d_h_next_t + d_h[:,t,:]
, whered_h
is the upstream gradient the loss function, which means that error signal of each cell gets accumulated. But once again, if you unroll LSTM, you'll see a direct correspondence with the network wiring.