I have a pandas dataframe df with column 'TARGET' which takes values of 0 or 1 and column 'MONTH' which collects different months:
| MONTH | #_OBS_TARGET=0 | #_OBS_TARGET=1 |
|---|---|---|
| 202207 | 44619 | 52960 |
| 202208 | 48093 | 55399 |
| 202209 | 50161 | 56528 |
I want to downsample my dataframe to have the same number of observations with TARGET = 0 and TARGET = 1 for every value of MONTH:
| MONTH | #_OBS_TARGET=0 | #_OBS_TARGET=1 |
|---|---|---|
| 202207 | 44619 | 44619 |
| 202208 | 48093 | 48093 |
| 202209 | 50161 | 50161 |
I tred the following
for m in df['MONTH'].unique():
number_of_ones = len(df[(df['MONTH']==m) & (df['TARGET']==1)])
number_of_zeros = len(df[(df['MONTH']==m) & (df['TARGET']==0)])
n_obs_to_drop = number_of_ones - number_of_zeros
df[df['MONTH']==m].drop(df[(df['MONTH']==m) & (df['TARGET']==1)].sample(n_obs_to_drop).index, inplace = True)
But clearly it is not deleting anything and I get the following warning (EDIT2):
/opt/conda/envs/librerias_cbi/lib/python3.9/site-packages/pandas/core/frame.py:4901: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
return super().drop(
How should I get it? Also a diffent approach is welcomed.
Notice that there are duplicate values of the index in diffent values of MONTH. There are also many more columns in df which should be kept in the downsampled dataframe.
EDIT1:
I'm adding a reproducible example
import pandas as pd
data = {
"MONTH": [202207, 202207, 202207, 202207, 202208, 202208, 202208, 202209, 202209, 202209, 202209],
"TARGET": [1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0],
"other_column1": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110], # Example additional columns
"other_column2": [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100]
}
df = pd.DataFrame(data)
pd.crosstab(df['MONTH'],df['TARGET'])
TARGET 0 1
MONTH
202207 1 3
202208 1 2
202209 2 2
Not sure if this is the most elegant way but I'll go for
Get the number of sample for each month
Merge with original df
Take a sample using size defined before