Back propagation algorithm: error computation

3k Views Asked by At

I am currently writing a back propagation script. I am unsure how to go about updating my weight values. Here is an image just to make things simple.

enter image description here

My question: How is the error calculated and applied?

I do know that k1 and k2 produce error values. I know that k1 and k2 produce individual error values (target - output). I do not however know if these are to be used.

Am I supposed to use the mean value of both error values and then apply that single error value to all of the weights?

Or am I supposed to:

update weight Wk1j1 and Wk1j2 with the error value of k1
update weight Wk2j1 and Wk2j2 with the error value of k2
update weight Wj1i1 and Wj1i2 with the error value of j1
update weight Wj2i1 and Wj2i2 with the error value of j2

Before you start shooting, I understand that I must use sigmoids function etc. THIS IS NOT THE QUESTION. It always states that I have to calculate the error value for the outputs, this is where I am confused.

and then get the net error value by:

((error_k1^2) + (error_k2^2) + (error_j1^2) + (error_j2^2)) / 2

From Wiki:

enter image description here

As the image states this is true for each of the output nodes, in my image example k1 and k2. The wiki.

The two rows under the image is delta Wh and delta Wi. Which error value am I supposed to use (this is basically my question, which error value am I supposed to calculate the new weight with)

Answer:

http://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf page 3(notad as 18) #4

2

There are 2 best solutions below

7
On

Back-propagation does not use the error values directly. What you back-propagate is the partial derivative of the error with respect to each element of the neural network. Eventually that gives you dE/dW for each weight, and you make a small step in the direction of that gradient.

To do this, you need to know:

  • The activation value of each neuron (kept from when doing the feed-forward calculation)

  • The mathematical form of the error function (e.g. it may be a sum of squares difference). Your first set of derivatives will be dE/da for the output layer (where E is your error and a is the output of the neuron).

  • The mathematical form of the neuron activation or transfer function. This is where you discover why we use sigmoid because dy/dx of the sigmoid function can conveniently be expressed in terms of the activation value, dy/dx = y * (1 - y) - this is fast and also means you don't have to store or re-calculate the weighted sum.

Please note, I am going to use different notation from you, because your labels make it hard to express the general form of back-propagation.

In my notation:

  • Superscripts in brackets (k) or (k+1) identify a layer in the network.

  • There are N neurons in layer (k), indexed with subscript i

  • There are M neurons in layer (k+1), indexed with subscript j

  • The sum of inputs to a neuron is z

  • The output of a neuron is a

  • A weight is Wij and connects ai in layer (k) to zj in layer (k+1). Note W0j is the weight for bias term, and sometimes you need to include that, although your diagram does not show bias inputs or weights.

With the above notation, the general form of the back-propagation algorithm is a five-step process:

1) Calculate initial dE/da for each neuron in the output layer. Where E is your error value, and a is the activation of the neuron. This depends entirely on your error function.

Then, for each layer (start with k = maximum, your output layer)

2) Backpropagate dE/da to dE/dz for each neuron (where a is your neuron output and z is the sum of all inputs to it including the bias) within a layer. In addition to needing to know the value from (1) above, this uses the derivative of your transfer function:

enter image description here

(Now reduce k by 1 for consistency with the remainder of the loop):

3) Backpropagate dE/dz from an upper layer to dE/da for all outputs in previous layer. This basically involves summing across all weights connecting that output neuron to the inputs in the upper layer. You don't need to do this for the input layer. Note how it uses the value you calculated in (2)

enter image description here

4) (Independently of (3)) Backpropagate dE/dz from an upper layer to dE/dW for all weights connecting that layer to the previous layer (this includes the bias term):

enter image description here

Simply repeat 2 to 4 until you have dE/dW for all your weights. For more advanced networks (e.g. recurrent), you can add in other error sources by re-doing step 1.

5) Now you have the weight derivatives, you can simply subtract them (times a learning rate) to take a step towards what you hope is the error function minimum:

enter image description here


The maths notation can seem a bit dense in places the first time you see this. But if you look a few times, you will see there are essentially only a few variables, and they are indexed by some combination of i, j, k values. In addition, with Matlab, you can express vectors and matrices really easily. So for instance this is what the whole process might look like for learning a single training example:

clear ; close all; clc; more off

InputVector          = [ 0.5, 0.2 ];
TrainingOutputVector = [ 0.1, 0.9 ];

learn_rate = 1.0;
W_InputToHidden  = randn( 3, 2 ) * 0.6;
W_HiddenToOutput = randn( 3, 2 ) * 0.6;

for i=1:20,
    % Feed-forward input to hidden layer
    InputsPlusBias = [1, InputVector];
    HiddenActivations = 1.0 ./ (1.0 + exp(-InputsPlusBias * W_InputToHidden));

    % Feed-forward hidden layer to output layer
    HiddenPlusBias = [ 1, HiddenActivations ];
    OutputActivations = 1.0 ./ (1.0 + exp(-HiddenPlusBias * W_HiddenToOutput));

    % Backprop step 1: dE/da for output layer (assumes mean square error)
    OutputActivationDeltas = OutputActivations - TrainingOutputVector;

    nn_error = sum( OutputActivationDeltas .* OutputActivationDeltas ) / 2;
    fprintf( 'Epoch %d, error %f\n', i, nn_error);

    % Steps 2 & 3 combined:
    % Back propagate dE/da on output layer to dE/da on hidden layer
    % (uses sigmoid  derivative)
    HiddenActivationDeltas = ( OutputActivationDeltas * W_HiddenToOutput(2:end,:)'
      .* ( HiddenActivations .* (1 - HiddenActivations) ) );

    % Steps 2 & 4 combined (twice):
    % Back propagate dE/da to dE/dW
    W_InputToHidden_Deltas  = InputsPlusBias' * HiddenActivationDeltas;
    W_HiddenToOutput_Deltas = HiddenPlusBias' * OutputActivationDeltas;

    % Step 5: Alter the weights
    W_InputToHidden  -= learn_rate * W_InputToHidden_Deltas;
    W_HiddenToOutput -= learn_rate * W_HiddenToOutput_Deltas;
end;

As written this is stochastic gradient descent (weights altering once per training example), and obviously is only learning one training example.


Apologies for pseudo-math notation in places. Stack Overflow doesn't have simple built in LaTex-like maths, unlike Math Overflow. I have skipped some of the derivation/explanation for steps 3 and 4 to avoid this answer taking forever.

0
On

Getting to understanding how a neural network (multilayer perceptron, specifically) does its magic can be deeply satisfying to a student of machine learning. Especially so if you can actually make one work and solve simple classification problems. Given your strong interest, you will definitely succeed.

Once you appreciate the fact that, in order to train a neural network, you need to somehow calculate the partial derivatives of the error with respect to weights, backpropagation can be easily and qualitatively derived by reducing it to three core concepts: (1) Boxing (2) Sensitivity and (3) Weight Updates. You already have some idea that (3) is necessary.

I agree that irritating and dense mathematical formulas ("What does this symbol mean again?") keep most students from enjoying this journey. In fact, Bernard Widrow, one of the pioneers in this area said so himself.

In a whitepaper that I authored earlier this year (no numbered equations!), I have tried my best to devise an intuitive notation that makes it easy to connect to the concept being symbolized. Something along the lines of calling the input I, the output O, the target T etc.

Reading this might help you with your question (and more): A Gentle Introduction to Backpropagation. This article contains pseudocode ("Training Wheels for Training Neural Networks") for implementing the algorithm.

The pseudocode will guide you through the following easily understood steps:

  1. Initialize the weights and set the learning rate and the stopping criteria.
  2. Randomly choose an input and the corresponding target.
  3. Compute the input to each layer and the output of the final layer.
  4. Compute the sensitivity components.
  5. Compute the gradient components and update the weights.
  6. Check against the stopping criteria. Exit and return the weights or loop back to Step 2.