I am a data science beginner, and I now have built a Python data model.
In the data cleaning part, I had to drop some columns, add new columns, hash some columns into new columns, change some columns to numeric
For example (not from any real project):
Before (Original Data columns): Name (str), City (str), State (str), Status (str), Gender (str), Salary (float)
After : City_hash (int), State_hash (int), City_State_hash (int) - a combined City + State, Status (int) - target variable, Gender(int), Salary (float)
The model's name is my_model. I want to test it now by passing a numpy array to the model.
The steps are below :
features = np.array([[xx,xx,xx,xx,xx,...]]) where x are the values to pass
# using inputs to predict the output
prediction = my_model.predict(features)
print("Prediction: {}".format(prediction))
I just wanted to clarify what to put under features. Should the values be in the order of my "After" data - i.e. City_hash, State_hash, City_State_hash, etc.
If yes, what about those hashed values, like the state of California (CA) is no longer CA but a hashed value. Do I have to use the hashed value ?
Thanks for any info ...
The model I actually created and want to test, if interested : https://www.kaggle.com/josephramon/sba-xgboost-model
Ok, I think I answered my own question through more testing. So I should input data in the same column order as the training dataset, but dropping the target variable.
For hashed columns, I needed to change my hash code to be reproducible in production. So if a user enters 'CA' for the state of California, my code should hash it exactly as it was hashed in the modeling data preparation.
As for one-hot encoded items, and other encodings, will just have to be able to reproduce them.
I also modified the model creation to split data into 3 - train:validation:test, instead of just train:valid. I can now use the test data, unseen previously, for testing and evaluating the metrics.