I am currently writing a back propagation script. I am unsure how to go about updating my weight values. Here is an image just to make things simple.
My question: How is the error calculated and applied?
I do know that k1 and k2 produce error values. I know that k1 and k2 produce individual error values (target - output). I do not however know if these are to be used.
Am I supposed to use the mean value of both error values and then apply that single error value to all of the weights?
Or am I supposed to:
update weight Wk1j1 and Wk1j2 with the error value of k1
update weight Wk2j1 and Wk2j2 with the error value of k2
update weight Wj1i1 and Wj1i2 with the error value of j1
update weight Wj2i1 and Wj2i2 with the error value of j2
Before you start shooting, I understand that I must use sigmoids function etc. THIS IS NOT THE QUESTION. It always states that I have to calculate the error value for the outputs, this is where I am confused.
and then get the net error value by:
((error_k1^2) + (error_k2^2) + (error_j1^2) + (error_j2^2)) / 2
From Wiki:
As the image states this is true for each of the output nodes, in my image example k1 and k2. The wiki.
The two rows under the image is delta Wh and delta Wi. Which error value am I supposed to use (this is basically my question, which error value am I supposed to calculate the new weight with)
Answer:
http://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf page 3(notad as 18) #4
Back-propagation does not use the error values directly. What you back-propagate is the partial derivative of the error with respect to each element of the neural network. Eventually that gives you dE/dW for each weight, and you make a small step in the direction of that gradient.
To do this, you need to know:
The activation value of each neuron (kept from when doing the feed-forward calculation)
The mathematical form of the error function (e.g. it may be a sum of squares difference). Your first set of derivatives will be dE/da for the output layer (where E is your error and a is the output of the neuron).
The mathematical form of the neuron activation or transfer function. This is where you discover why we use sigmoid because dy/dx of the sigmoid function can conveniently be expressed in terms of the activation value, dy/dx = y * (1 - y) - this is fast and also means you don't have to store or re-calculate the weighted sum.
Please note, I am going to use different notation from you, because your labels make it hard to express the general form of back-propagation.
In my notation:
Superscripts in brackets (k) or (k+1) identify a layer in the network.
There are N neurons in layer (k), indexed with subscript i
There are M neurons in layer (k+1), indexed with subscript j
The sum of inputs to a neuron is z
The output of a neuron is a
A weight is Wij and connects ai in layer (k) to zj in layer (k+1). Note W0j is the weight for bias term, and sometimes you need to include that, although your diagram does not show bias inputs or weights.
With the above notation, the general form of the back-propagation algorithm is a five-step process:
1) Calculate initial dE/da for each neuron in the output layer. Where E is your error value, and a is the activation of the neuron. This depends entirely on your error function.
Then, for each layer (start with k = maximum, your output layer)
2) Backpropagate dE/da to dE/dz for each neuron (where a is your neuron output and z is the sum of all inputs to it including the bias) within a layer. In addition to needing to know the value from (1) above, this uses the derivative of your transfer function:
(Now reduce k by 1 for consistency with the remainder of the loop):
3) Backpropagate dE/dz from an upper layer to dE/da for all outputs in previous layer. This basically involves summing across all weights connecting that output neuron to the inputs in the upper layer. You don't need to do this for the input layer. Note how it uses the value you calculated in (2)
4) (Independently of (3)) Backpropagate dE/dz from an upper layer to dE/dW for all weights connecting that layer to the previous layer (this includes the bias term):
Simply repeat 2 to 4 until you have dE/dW for all your weights. For more advanced networks (e.g. recurrent), you can add in other error sources by re-doing step 1.
5) Now you have the weight derivatives, you can simply subtract them (times a learning rate) to take a step towards what you hope is the error function minimum:
The maths notation can seem a bit dense in places the first time you see this. But if you look a few times, you will see there are essentially only a few variables, and they are indexed by some combination of i, j, k values. In addition, with Matlab, you can express vectors and matrices really easily. So for instance this is what the whole process might look like for learning a single training example:
As written this is stochastic gradient descent (weights altering once per training example), and obviously is only learning one training example.
Apologies for pseudo-math notation in places. Stack Overflow doesn't have simple built in LaTex-like maths, unlike Math Overflow. I have skipped some of the derivation/explanation for steps 3 and 4 to avoid this answer taking forever.