Suppose I have this dataframe (in a regression problem) with numerical and categorical data:
df_example
Var1_numerical Var2_categorical Var3_numerical Var4_categorical Var_to_predict
20 red 1 BK 352352
10 blue 4 BL 345341
5 orange 6 BA 423423
1 red 3 BK 342342
90 orange 2 BK 456456
So, in one part of the process I will use RobustScaler() on the numeric variables and OneHotEncoder() on the categorical variables so that the model can learn from these variables. And now I will have my model trained to predict with a certain error for that prediction.
The interesting thing is to predict on new data using model.predict()
pred_list_example=[15, red, 1, BK]
a = np.array(pred_list)
a = np.expand_dims(a, 0)
model.predict(a)
Question 1: Do I need to use RobustScaler() and OneHotEncoder() on pred_list_example before using model.predict(a)?
Question 2: In case the answer to the previous question is "yes", the Var_to_predict will be scaled due to RobustScaler(). Do I need to use RobustScaler().inverse_transform to get the original numeric value of the prediction?
Yes, and more than that: you must use the same
RobustScaler()orOneHotEncoder()to do the transformation, or it won't know how much to scale by or what order your one hot categories go in.Yes, though note a subtlety:
RobustScaler()requires a certain number of columns, and scales each one by a different amount. This means that there's no easy way to give it just your Y variable, and ask it to undo the transform on this one variable.For this reason, I suggest having two
RobustScaler()instances: one for your X variables and one for your Y variable, so that you can undo scaling on a predicted Y variable without having the X variables to go with it.There is also the question of whether it is even needed to scale Y variables. Some people would say that it's not necessary. You can read a pro and con argument here.