Label Encoding using weights for string nominal variables for random forest classification

299 Views Asked by Sujit Desai At 16 December 2020 at 01:36

I have NYC 311 complaint dataset. I want to build a random forest classifier which will take categorical input features about a complaint and will determine the complaint type. Following are the input feature of a given complaint record

X = df[['Location Type', 'Incident Zip', 'Street Name', 
'City', 'Borough', 'Open Data Channel Type']]

all of these features are nominal variables(categorical) I will need to convert the string variables into float ones before feeding them to the model. I am reluctant to use one hot label encoding since some features has more than 1000 categories and further computation might be out of reach of my laptop.

I was thinking of using the weights of each variable (count of particular category/ total count) instead of the nominal string variables? will it be a good strategy?

Original Q&A

There are 2 best solutions below

StupidWolf On 16 December 2020 at 02:05

Random forest is an ensemble of decision trees where you try to divide your data into subsets based on splits of your variables. If it is coding each categorical variable in terms of their frequency, this is not very sound. This assumes that categories with similar frequencies will perform likewise in predicting the response, and there is nothing in your data to suggest that.

In the case you have 1000+ categories, it might make more sense to group some of the rare categories or singletons into 1 big category like "others" before doing the onehot encoding.

Ion Lesan On 16 December 2020 at 03:39

Replacing a category with its relative frequency is not a good idea, because category frequency as a continuous variable will not reflect their semantics.

One-hot encoding is the way to go, it will just require an additional dimensionality reduction step. Here are some options:

For each categorical variable, do compute category frequencies, but use them to discard any low-frequency values, i.e. after hot-encoding keep the columns that correspond to top N values. You may be tempted to think that every value with only a few occurrences will contribute to predictivity, but it will not. Statistically a category that's only occurring a few times is not sufficient to generalize from it and serve as a predictor.
You can try applying PCA to each one-hot encoded variable. However in theory PCA requires normally distributed variables, so using it for binary variables is debatable.
Instead of one-hot encoding, try encoding using something like char2vec (e.g. using https://pypi.org/project/gensim/). The advantage is it will create similar vectors for alternative or mis-spellings of your categories, e.g. the vectors for 'color' and 'colour' will be very close to each other which is what you want.
Do some additional research for other dimensionality reduction techniques.

Label Encoding using weights for string nominal variables for random forest classification

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in RANDOM-FOREST

Related Questions in DECISION-TREE

Related Questions in CATEGORICAL-DATA

Related Questions in LABEL-ENCODING

Trending Questions

Popular # Hahtags

Popular Questions