what is the right way to scale data for tensorflow

818 Views Asked by At

For input to neural nets, data has to be scaled to [0,1] range. For this often I see the following kind of code in blogs:

x_train, x_test, y_train, y_test = train_test_split(x, y)
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

The problem here is that the min/max of the test set may be outside the range of the min/max of the training set. If that happens, then the normalized values in X_test will be greater than 1.0, or negative. Eg:

train_data = [[0,3],[0,7],[0,9],[0,16],[0,10]]
test_data = [[1,1],[1,25],[1,6]]
scaler = MinMaxScaler()
train_scaled = scaler.fit_transform(train_data)
test_scaled = scaler.transform(test_data)
print(test_scaled)

[[ 1.         -0.15384615]
 [ 1.          1.69230769]
 [ 1.          0.23076923]]

A trivial solution is to scale before splitting, but that would not solve the problem except in toy samples. As a real life example, consider anomaly detection where the training set typically consists of fairly normal data. In such cases, where the anomaly situation may well contain data outside the range of what the network has seen during the training phase.

In such situations, is it ok to feed >1.0 or <0.0 numbers to a neural network? If not, what is the recommended way to normalize the data?

(One possible solution is to define an upper bound for the values, eg 120% of the maximum value seen during training, and saturate any value over that to this upper bound. But is there a predefined scaling function that does this kind of cutoff before scaling?)

1

There are 1 best solutions below

2
fbrand On

I understand what you are saying but I think it is because your train and test sets does not come from the same dataset, thus the same ranges. The X_test and X_train sets should be representative of each other. If you create a large random dataset and then split it you will find that MinMaxScaler() does what it needs to within the ranges specified.

a sidenote: I personally don't agree with scaling before splitting, thus creating leakage.