I've seen a few questions on class imbalance in a multiclass setting. However, I have a multi-label problem, so how would you deal with it in this case?
I have a set of around 300k text examples. As mentioned in the title, each example has at least one label, and there are only 100 possible unique labels. I've reduced this problem down to binary classification for Vowpal Wabbit by taking advantage of namespaces, e.g.
From:
healthy fruit | bananas oranges jack fruit
evil monkey | bipedal organism family guy
...
To:
1 |healthy bananas oranges jack fruit
1 |fruit bananas oranges jack fruit
0 |evil bananas oranges jack fruit
0 |monkey bananas oranges jack fruit
0 |healthy bipedal organism family guy
0 |fruit bipedal organism family guy
1 |evil bipedal organism family guy
1 |monkey bipedal organism family guy
...
I'm using the default options provided by VW (which I think is online SGD, with the squared loss function). I'm using the squared loss because it closely resembles the Hamming Loss.
After training, when testing on the same training set, I've noticed that all examples were predicted with the '0' label... which is one way of minimizing loss, I guess. At this point, I'm not sure what to do. I was thinking of using cost-sensitive one-against-all classification to try to balance the classes, but reducing multi-label to multi-class is unfeasible since there exists 2^100 label combinations. I'm wondering if anyone else have any suggestions.
Edit: I finally had the chance to test out class-imbalance, specifically for vw
. vw
handles imbalance very badly, at least for highly-dimensional, sparsely-populated text features. I've tried ratios from 1:1, to 1:25, with performance degrading abruptly at the 1:2 ratio.
In general, if you're looking to account for a class imbalance in your training data it means you have to change to a better suited loss function. Specifically for class imbalance, you want to change your loss function to area under the ROC curve. Specifically designed to account for this issue.
There's a multi-label version, but if you've already reduced it to binary classification it should just work out of the box.
Here's a wikipedia article explaining the concept more fully.
And here's the relevant sklearn documentation, which might less helpful since I'm not sure what language this is happening in.