Downsample a pandas dataframe keeping same proportion of target in every month

65 Views Asked by Ale At 26 March 2024 at 11:52

I have a pandas dataframe df with column 'TARGET' which takes values of 0 or 1 and column 'MONTH' which collects different months:

MONTH	#_OBS_TARGET=0	#_OBS_TARGET=1
202207	44619	52960
202208	48093	55399
202209	50161	56528

I want to downsample my dataframe to have the same number of observations with TARGET = 0 and TARGET = 1 for every value of MONTH:

MONTH	#_OBS_TARGET=0	#_OBS_TARGET=1
202207	44619	44619
202208	48093	48093
202209	50161	50161

I tred the following

for m in df['MONTH'].unique():
    number_of_ones = len(df[(df['MONTH']==m) & (df['TARGET']==1)])
    number_of_zeros = len(df[(df['MONTH']==m) & (df['TARGET']==0)])
    n_obs_to_drop = number_of_ones - number_of_zeros 
    df[df['MONTH']==m].drop(df[(df['MONTH']==m) & (df['TARGET']==1)].sample(n_obs_to_drop).index, inplace = True)

But clearly it is not deleting anything and I get the following warning (EDIT2):

/opt/conda/envs/librerias_cbi/lib/python3.9/site-packages/pandas/core/frame.py:4901: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(

How should I get it? Also a diffent approach is welcomed.

Notice that there are duplicate values of the index in diffent values of MONTH. There are also many more columns in df which should be kept in the downsampled dataframe.

EDIT1:

I'm adding a reproducible example

import pandas as pd
data = {
"MONTH": [202207, 202207, 202207, 202207, 202208, 202208, 202208, 202209, 202209, 202209, 202209],
"TARGET": [1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0],
"other_column1": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110],  # Example additional columns
"other_column2": [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100]
}
    
df = pd.DataFrame(data)

pd.crosstab(df['MONTH'],df['TARGET'])

TARGET  0   1
MONTH       
202207  1   3
202208  1   2
202209  2   2

Original Q&A

There are 2 best solutions below

rpanai On 26 March 2024 at 12:34

Not sure if this is the most elegant way but I'll go for

Get the number of sample for each month

grp = pd.crosstab(df['MONTH'],df['TARGET'])\
    .min(1)\
    .reset_index(name='size')

    MONTH  size
0  202207     1
1  202208     1
2  202209     2

Merge with original df

df1 = pd.merge(df, grp)

Take a sample using size defined before

df2 = df1.groupby(['MONTH', 'TARGET'])\
    .apply(lambda x: x.sample(n=x['size'].iloc[0]))\
    .reset_index(drop=True)

Ale On 26 March 2024 at 17:16

I guess the problem was applying the .drop() to each slice of df, or something related to duplicate values in the index (for different values of MONTH). I solved the problem this way:

df_temp = pd.DataFrame() # create an empty df that will be filled in each iteration of the loop

for mes in df['MONTH'].unique():
    df_slice = df[df["MONTH"]==mes].copy() # create a temporary df where the index is unique
    
    number_of_ones = len(df_slice[df_slice['TARGET']==1])
    number_of_zeros = len(df_slice[df_slice['TARGET']==0])
    n_obs_to_drop = number_of_ones - number_of_zeros
   
    df_slice.drop(df_slice[df_slice['TARGET']==1].sample(n_obs_to_drop).index, inplace = True) # drop rows only for the partition
    
    df_temp = pd.concat([df_temp,df_slice]) # concatenate the partitions

df = df_temp
del df_temp
del df_slice

Downsample a pandas dataframe keeping same proportion of target in every month

There are 2 best solutions below

Get the number of sample for each month

Merge with original df

Take a sample using size defined before

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in DATA-MANIPULATION

Related Questions in DOWNSAMPLING

Trending Questions

Popular # Hahtags

Popular Questions