Do I have to use a Scale-Layer after every BatchNorm Layer?

1.3k Views Asked by At

I am using caffe , in detail pycaffe, to create my neuronal network. I noticed that I have to use BatchNormLayer to get a positive result. I am using the Kappa-Score as a result matrix. I now have seen several different locations for the BatchNorm-Layers in my network. But I came across the ScaleLayer, too which is not in the Layer Catalogue but gets often mentioned with the BatchNorm Layer

Do you always need to put a ScaleLayer after a BatchNorm - Layer and what does it do?

2

There are 2 best solutions below

2
On BEST ANSWER

From the original batch normalization paper by Ioffe & Szegedy: "we make sure that the transformation inserted in the network can represent the identity transform." Without the Scale layer after the BatchNorm layer, that would not be the case because the Caffe BatchNorm layer has no learnable parameters.

I learned this from the Deep Residual Networks git repo; see item 6 under disclaimers and known issues there.

0
On

In general, you will get no benefit from a scale layer juxtaposed with batch normalization. Each is a linear transformation. Where BatchNorm translates so that the new distribution has a mean of 0 and variance of 1, Scale compresses the entire range into a specified interval, typically [0,1]. Since they're both linear transformations, if you do them in sequence, the second will entirely undo the work of the first.

They also deal somewhat differently with outliers. Consider a set of data: ten values, five each of -1 and +1. BatchNorm will not change this at all: it already has mean 0 and variance 1. For consistency, let's specify the same interval for Scale, [-1, 1], which is also a popular choice.

Now, add an outlier of, say 99 to the mix. Scale will transform the set to the range [-1, 1] so that there are now five -1.00 values, one +1.00 value (the former 99), and five values of -0.96 (formerly +1).

BatchNorm worries about the mean standard deviation, not the max and min values. The new mean is +9; the S.D. is 28.48 (rounding everything to 2 decimal places). The numbers will be scaled to be roughly five values each of -.35 and -.28, and one value of 3.16

Whether one scaling works better than the other depends much on the skew and scatter of your distribution. I prefer BatchNorm, as it tends to differentiate better in dense regions of a distribution.