MSE Cost Function for Training Neural Network

6k Views Asked by At

In an online textbook on neural networks and deep learning, the author illustrates neural net basics in terms of minimizing a quadratic cost function which he says is synonymous with mean squared error. Two things have me confused about his function, though (pseudocode below).

MSE≡(1/2n)*∑‖y_true-y_pred‖^2

  1. Instead of dividing the sum of squared errors by the number of training examples n why is it instead divided by 2n? How is this the mean of anything?
  2. Why is double bar notation used instead of parentheses? This had me thinking there was some other calculation going on, such as of an L2-norm, that is not shown explicitly. I suspect this is not the case and that term is meant to express plain old sum of squared errors. Super confusing though.

Any insight you can offer is greatly appreciated!

3

There are 3 best solutions below

0
On

The double bar is the distance measure, and the bracket is incorrect if y is multi-dimenssional. For mean squared error, there is no 2 with n, but it is unimportant. It will be absorbed by the learning rate. However it is often there to cancel the square number 2 when evaluating the derivative.

0
On

The notation ∥v∥ just denotes the usual length function for a vector v. From the online textbook you referenced.

Find more info on the double bars here. But from what I understand, you can basically view it as an absolute term.

I'm not sure why it says 2n, but it's not always 2n. Wikipedia for example writes the function as follows:

enter image description here

Googling Mean Squared Error also has a lot of sources using the Wikipedia one, instead of theo ne from the online textbook.

0
On

The 0.5 factor by which the cost function is multiplied is not important. In fact you could multiply it by any real constant you want, and the learning would be the same. It's only used so that the derivative of the cost function with respect to the output will simply be $$y - y_{t}$$. Which is convenient in some applications, like backpropagation.