Scikit Learn: Predicting Categorical Features

2.3k Views Asked by At

I am trying to figure out the best general way to predict categorical features in scikit-learn and would like some advice. In particular, I can just do a decision tree and it will handle the categorical data just fine, but I would like to try out some other multi-classification models. I can use the one-hot method to turn the categorical features into lots of binary features.

Example training set:

Age| Color  | City     | Freq
35 |'Orange'|'Seattle' | '<30'
55 |'Black' |'Portland'| '>30'
75 |'Red'   |'Seattle' | 'Never'

Can easily be changed to:

Age| Color |City | Freq
35 | 1 0 0 | 1 0 | 1 0 0
55 | 0 1 0 | 0 1 | 0 1 0
75 | 0 0 1 | 1 0 | 0 0 1

And I can split this into data target pairs:

X= Age| Color |City
   35 | 1 0 0 | 1 0
   55 | 0 1 0 | 0 1
   75 | 0 0 1 | 1 0

y= Freq
   1 0 0
   0 1 0
   0 0 1

Then I am able to process this with various SK-Learn classification models, but its not clear to me that the three 'Freq' features are understood to be mutually exclusive. Hence my question:

Is it possible to predict categorical features with generalized classification routines besides just decision trees?

How does one ensure that a set of binary features remain mutually exclusive?

Further, is it possible to present the results in a metric that joins the three binary features intelligently?

Thanks for your help!

1

There are 1 best solutions below

0
On

Yes, it's possible. Just don't 'one-hot' your output vector. Convert it to a number.

As in Freq:

'<30' = 0
'>30' = 1
'Never' = 2

If you do this any regression algorithm should work. You can then set thresholds for each one of your output classes.

Another option would be to have 3 binary classification models, each one trained for each one of the classes.

Look at Softmax regression, additionally.