Label Encoding of Categorical values for Future df

20 Views Asked by At

I am building a model where LabelEncoding of 2 categorical columns is a better approach. So I had implemented the same on the train_df and finalized the model.

And for predicting the test_df, I used to fit the 2 categorical columns on train_df and then transform the values on test_df as something like below:

from sklearn.preprocessing import LabelEncoder

le = preprocessing.LabelEncoder()
le.fit(train_df)

le.transform(test_df)

Now I have to save and give the .pkl file of the model to some other team. If in this case, they want to use the model, do they have to fit the labelencoding on train_df again and then transform on their new data?

1

There are 1 best solutions below

0
Ben Reiniger On

You should save (e.g. as another pickle) your fitted LabelEncoder and provide that along with the model, and instructions (a python script/snippet?) for how to use them to reach a final prediction (le.transform then model.predict).

You might consider using a Pipeline (and potentially other composite estimators) from sklearn to package all of that into one object that just needs to predict.

N.B., LabelEncoder is supposed to be used for target variables (and even then mostly just internally), you would probably be slightly better off with OrdinalEncoder. See e.g. this DS.SE question.