I am evaluating an SOM/Kohonen Map as a regressor for a dataset. Unfortunately it performs extremely bad - so bad, that I think I might have an error in my code. While the R2 score for the training dataset is usually roughly only around 1-5%, the R2 score for the test dataset is ALWAYS extremely negative; example:
Train: 1.09 %
Test: -5668908.61 %
Even though I went over my code over and over again, I just want to make sure, that I did not make a mistake with scaling the data or such, which might cause the bad performance. Basically I split the data into X and y and then use sklearns test_train_split() to get the respective datasets.
I use sklearns MinMaxScaler() to fit_transform() X_train and apply the same transformation on X_test so that there is no data leakage. For y_train I use a separate scaler (scalery).
After each model is trained, I use the y_train scaler (scalery) to inverse the scaling on y_pred, y_pred_train and y_train.
Is there some mistake in my approach? I just want to make sure, that this type of model performs just inherently badly and not because of an error on my side.
Here is my code:
data = load_dataset(currency, 1440, predictor, data_range)
X = data.drop(predictor, axis =1)
y = data[[predictor]]
scaler = MinMaxScaler(feature_range=(0, 1))
scalery = MinMaxScaler(feature_range=(0, 1))
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
shuffle=False,
)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train = scalery.fit_transform(y_train)
map_size= int(5* math.sqrt(X_test.shape[0])) #vesanto
info_dict = {
'currency': currency,
'data_range': data_range,
'epochs': 0
}
for i in range(100,2100,100):
info_dict['epochs'] = i
print(f"GridSearch Configuration: {map_size}x{map_size}")
print(currency, data_range, i)
som = susi.SOMRegressor(
n_rows=map_size,
n_columns=map_size,
n_iter_unsupervised=i,
n_iter_supervised=i,
neighborhood_mode_unsupervised="linear",
neighborhood_mode_supervised="linear",
learn_mode_unsupervised="min",
learn_mode_supervised="min",
learning_rate_start=0.5,
learning_rate_end=0.05,
# do_class_weighting=True,
random_state=None,
n_jobs=1)
som.fit(X_train, y_train.ravel())
y_pred = som.predict(X_test)
y_pred_train = som.predict(X_train)
y_pred = scalery.inverse_transform(pd.DataFrame(y_pred))
y_train = scalery.inverse_transform(pd.DataFrame(y_train))
y_pred_train = scalery.inverse_transform(pd.DataFrame(y_pred_train))
print("Train: {0:.2f} %".format(r2_score(y_train, y_pred_train)*100))
print("Test: {0:.2f} %".format(r2_score(y_test, y_pred)*100))