Oversampling a sparse dataset in Python

1k Views Asked by At

I have a dataset that has a multi-labeled data. There is a total of 20 labels (from 0 to 20) which has an imbalance distribution among them. Here is an overview of the data:

|id   |label|value       |
|-----|-----|------------|
|95534|0    |65.250002088|
|95535|18   |            |
|95536|0    |            |
|95536|0    |100         |
|95536|0    |            |
|95536|0    |53.68547236 |
|95536|0    |            |
|95537|1    |            |
|95538|0    |            |
|95538|0    |            |
|95538|0    |            |
|95538|0    |656.06155202|
|95538|0    |            |
|95539|2    |            |
|5935 |0    |            |
|5935 |0    |150         |
|5935 |0    |50          |
|5935 |0    |24.610985335|
|5935 |0    |            |
|5935 |0    |223.81789584|
|5935 |0    |148.1805218 |
|5935 |0    |110.9712538 |
|34147|19   |73.62651909 |
|34147|19   |            |
|34147|19   |53.35958016 |
|34147|19   |            |
|34147|19   |            |
|34147|19   |            |
|34147|19   |393.54029411|

I am looking to oversample the data and make a balance between the labels. I came across some methods like SMOTE and SMOTENC but they are all required splitting the data into train and test set and they are not working with sparse data. Is there any way that I can do this on the entire data in the pre-processing step before splitting?

2

There are 2 best solutions below

0
On BEST ANSWER

To sample rows so that each label is sampled with equal probability:

  • the probability to draw a row of a given label should be 1/n_labels
  • the probability to draw a given row for a given label l should be 1/n_rows for n_rows in that label

The probability for each row is then p_row = 1/(n_labels*n_rows). You can generate these values with groupby and pass them to df.sample as follows:

import numpy as np
import pandas as pd

df_dict = {'id': {0: 95535, 1: 95536, 2: 95536, 3: 95536, 4: 95536, 5: 95536, 6: 95537, 7: 95538, 8: 95538, 9: 95538, 10: 95538, 11: 95538, 12: 95539, 13: 5935, 14: 5935, 15: 5935, 16: 5935, 17: 5935, 18: 5935, 19: 5935, 20: 5935, 21: 34147, 22: 34147, 23: 34147, 24: 34147, 25: 34147, 26: 34147, 27: 34147}, 'label': {0: 18, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 1, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 2, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 19, 22: 19, 23: 19, 24: 19, 25: 19, 26: 19, 27: 19}, 'value': {0: '            ', 1: '            ', 2: '100         ', 3: '            ', 4: '53.68547236 ', 5: '            ', 6: '            ', 7: '            ', 8: '            ', 9: '            ', 10: '656.06155202', 11: '            ', 12: '            ', 13: '            ', 14: '150         ', 15: '50          ', 16: '24.610985335', 17: '            ', 18: '223.81789584', 19: '148.1805218 ', 20: '110.9712538 ', 21: '73.62651909 ', 22: '            ', 23: '53.35958016 ', 24: '            ', 25: '            ', 26: '            ', 27: '393.54029411'}}    

df = pd.DataFrame.from_dict(d)

# create column that includes counts by label
n_labels = df.label.nunique()
n_rows = df.groupby("label").id.transform("count")
weights = 1/(n_rows*n_labels)

# sanity check probabilities:
bool(np.sum(weights) == 1)    

df_samples = df.sample(n=40000, weights=weights, replace=True, random_state=19)

verify that label draws are approximately uniform:

print(df_samples.label.value_counts()/len(df_samples))

# sampling frequency by group:
# 0     0.203325
# 2     0.201075
# 18    0.200925
# 19    0.198850
# 1     0.195825
1
On

Actually theoretically speaking you don't need to upsample your test set.

In class imbalance settings, artificially balancing the test/validation set does not make any sense: these sets must remain realistic, i.e. you want to test your classifier performance in the real world setting, where, say, the negative class will include the 99% of the samples, in order to see how well your model will do in predicting the 1% positive class of interest without too many false positives. Artificially inflating the minority class or reducing the majority one will lead to performance metrics that are unrealistic, bearing no real relation to the real world problem you are trying to solve.

Re-balancing makes sense only in the training set, so as to prevent the classifier from simply and naively classifying all instances as negative for a perceived accuracy of 99%.

Hence, you can rest assured that in the setting you describe the rebalancing takes action only for the training set/folds.