Dealing with imbalanced datasets in classification

75 Views Asked by At

I have a large dataframe based in Accounting Fraud which I would like to resolve the problem of imbalanced data.

First of all, I split the data frame into 2: X (variables) and y (the goal, which is: fraud or no fraud)

I tried this:

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.combine import SMOTEENN

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler

X = df[['fyear', 'gvkey', 'sich', 'insbnk', 'understatement', 'option',
       'p_aaer', 'new_p_aaer', 'act', 'ap', 'at', 'ceq', 'che',
       'cogs', 'csho', 'dlc', 'dltis', 'dltt', 'dp', 'ib', 'invt', 'ivao',
       'ivst', 'lct', 'lt', 'ni', 'ppegt', 'pstk', 're', 'rect', 'sale',
       'sstk', 'txp', 'txt', 'xint', 'prcc_f', 'dch_wc', 'ch_rsst', 'dch_rec',
       'dch_inv', 'soft_assets', 'ch_cs', 'ch_cm', 'ch_roa', 'issue', 'bm',
       'dpi', 'reoa', 'EBIT', 'ch_fcf']]
y = df[['target']]

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
print('Resampled dataset shape {}'.format(Counter(y_res)))

and this

# define sampling strategy
sample = SMOTEENN(sampling_strategy=0.5)
# fit and apply the transform
X_over, y_over = sample.fit_resample(X, y)
# summarize class distribution
print(Counter(y_over)) 

But in both cases, the result was the same:

ValueError: could not convert string to float: '2.461.242' 

Please, someone can help me?

0

There are 0 best solutions below