I am trying to figure out the best general way to predict categorical features in scikit-learn and would like some advice. In particular, I can just do a decision tree and it will handle the categorical data just fine, but I would like to try out some other multi-classification models. I can use the one-hot method to turn the categorical features into lots of binary features.
Example training set:
Age| Color | City | Freq
35 |'Orange'|'Seattle' | '<30'
55 |'Black' |'Portland'| '>30'
75 |'Red' |'Seattle' | 'Never'
Can easily be changed to:
Age| Color |City | Freq
35 | 1 0 0 | 1 0 | 1 0 0
55 | 0 1 0 | 0 1 | 0 1 0
75 | 0 0 1 | 1 0 | 0 0 1
And I can split this into data target pairs:
X= Age| Color |City
35 | 1 0 0 | 1 0
55 | 0 1 0 | 0 1
75 | 0 0 1 | 1 0
y= Freq
1 0 0
0 1 0
0 0 1
Then I am able to process this with various SK-Learn classification models, but its not clear to me that the three 'Freq' features are understood to be mutually exclusive. Hence my question:
Is it possible to predict categorical features with generalized classification routines besides just decision trees?
How does one ensure that a set of binary features remain mutually exclusive?
Further, is it possible to present the results in a metric that joins the three binary features intelligently?
Thanks for your help!
Yes, it's possible. Just don't 'one-hot' your output vector. Convert it to a number.
As in Freq:
If you do this any regression algorithm should work. You can then set thresholds for each one of your output classes.
Another option would be to have 3 binary classification models, each one trained for each one of the classes.
Look at Softmax regression, additionally.