I have NYC 311 complaint dataset. I want to build a random forest classifier which will take categorical input features about a complaint and will determine the complaint type. Following are the input feature of a given complaint record
X = df[['Location Type', 'Incident Zip', 'Street Name',
'City', 'Borough', 'Open Data Channel Type']]
all of these features are nominal variables(categorical) I will need to convert the string variables into float ones before feeding them to the model. I am reluctant to use one hot label encoding since some features has more than 1000 categories and further computation might be out of reach of my laptop.
I was thinking of using the weights of each variable (count of particular category/ total count) instead of the nominal string variables? will it be a good strategy?
Random forest is an ensemble of decision trees where you try to divide your data into subsets based on splits of your variables. If it is coding each categorical variable in terms of their frequency, this is not very sound. This assumes that categories with similar frequencies will perform likewise in predicting the response, and there is nothing in your data to suggest that.
In the case you have 1000+ categories, it might make more sense to group some of the rare categories or singletons into 1 big category like "others" before doing the onehot encoding.