I created a RandomForestClassification model using SkLearn using 10 different text features and a training set of 10000. Then, I pickled the model (76mb) in hopes of using it for prediction.
However, in order to produce the Random Forest, I used the LabelEncoder and OneHotEncoder for best results on the categorical/string data.
Now, I'd like to pull up the pickled model and get a classification prediction on 1 instance. However, I'm not sure how to encode the text on the 1 instance without loading the entire training & test dataset CSV again and going through the entire encoding process.
It seems quite laborious to load the csv files every time. I'd like this to run 1000x per hour so it doesn't seem right to me.
Is there a way to quickly encode 1 row of data given the pickle or other variable/setting? Does encoding always require ALL the data?
If loading all the training data is required to encode a single row, would be advantageous to encode the text data myself in a database where each feature assigned to a table, auto-incremented with a numeric id and a UNIQUE key on the text/categorical field, then pass this id to the RandomForestClassification? Obviously I would need to refit and pickle this new model, but then I would know exactly the (encoded) numeric representation of a new row and simply request a prediction on those values.
It's highly likely that I'm missing a feature or misunderstanding SkLearn or Python, I only started both a 3 days ago. Please excuse my naivety.
Using Pickle you should save your Label and One Hot Encoder. You can then read this each time and easily transform new instances. For example,