Predicting from SciKitLearn RandomForestClassification with Categorical Data

847 Views Asked by JV88V At 16 August 2025 at 21:30

I created a RandomForestClassification model using SkLearn using 10 different text features and a training set of 10000. Then, I pickled the model (76mb) in hopes of using it for prediction.

However, in order to produce the Random Forest, I used the LabelEncoder and OneHotEncoder for best results on the categorical/string data.

Now, I'd like to pull up the pickled model and get a classification prediction on 1 instance. However, I'm not sure how to encode the text on the 1 instance without loading the entire training & test dataset CSV again and going through the entire encoding process.

It seems quite laborious to load the csv files every time. I'd like this to run 1000x per hour so it doesn't seem right to me.

Is there a way to quickly encode 1 row of data given the pickle or other variable/setting? Does encoding always require ALL the data?

If loading all the training data is required to encode a single row, would be advantageous to encode the text data myself in a database where each feature assigned to a table, auto-incremented with a numeric id and a UNIQUE key on the text/categorical field, then pass this id to the RandomForestClassification? Obviously I would need to refit and pickle this new model, but then I would know exactly the (encoded) numeric representation of a new row and simply request a prediction on those values.

It's highly likely that I'm missing a feature or misunderstanding SkLearn or Python, I only started both a 3 days ago. Please excuse my naivety.

Original Q&A

There are 1 best solutions below

user7347576 On 06 January 2017 at 00:05 BEST ANSWER

Using Pickle you should save your Label and One Hot Encoder. You can then read this each time and easily transform new instances. For example,

import cPickle as pickle
from sklearn.externals import joblib
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
train_x = [0,1,2,6,'true','false']
le.fit_transform(train_x)

# Save your encoding
joblib.dump(le, '/path/to/save/model')
# OR
pickle.dump(le, open( '/path/to/model', "wb" ) )

# Load those encodings
le = joblib.load('/path/to/save/model') 
# OR
le = pickle.load( open( '/path/to/model', "rb" ) )

# Then use as normal
new_x = [0,0,0,2,2,2,'false']
le.transform(new_x)
# array([0, 0, 0, 1, 1, 1, 3])

Predicting from SciKitLearn RandomForestClassification with Categorical Data

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in SCIKIT-LEARN

Related Questions in RANDOM-FOREST

Related Questions in TEXT-CLASSIFICATION

Trending Questions

Popular # Hahtags

Popular Questions