How to dynamically discretize a pandas column based on the summation of another column?

315 Views Asked by At

I have a process I repeat...often, but do not know how to create a dynamic function to make this easier. I need to take a variable (age for example) and bin the age into discrete bins equaling at least 1000 units of another column (weight). I need to do this dynamically for any variable as well as control the weight sums. So if I want to bin in 500 unit or 2000 unit increments, I need that as a changeable parameter.

Here is a sample data set:

import pandas as pd
import numpy as np

age = np.arange(0,30,1)

weight = np.linspace(200, 2500,30)

bin_dict = {'age':age, 'weight':weight}

df = pd.DataFrame(bin_dict)

Now let's say i want to bin age into bins of no less than 1000 units. this is what the result would look like:

df['bin_aged'] = pd.cut(df['age'],
                        bins = [-np.inf, 3, 5, 7, 9, 11,12,13,14,15,16,17,
                                18,19,20,21,22,23,24,25,26,27,28,np.inf],
                        labels = [3, 5, 7, 9, 11,12,13,14,15,16,17,
                                18,19,20,21,22,23,24,25,26,27,28,29])

This is what it would look like if I grouped by the new binned column:

df.groupby('bin_aged').agg({'weight':'sum'})

Is this possible?

1

There are 1 best solutions below

0
On

I figured this out....

def dynamic_bin(df, column, weight, minimum):
    """
    

    Parameters
    ----------
    df : dataframe
    column : column to be binned
    weight : column that will dictate the bin
    minimum : minimum weight per bin

    Returns
    -------
    df : dataframe with new binned column

    """
    bins = [-np.inf]
    labels = [] 
    hold_over = []
    for i in sorted(df[column].unique()):
        g = df[df[column] == i].groupby(column).agg({weight:'sum'}).reset_index()
        
        if g[weight].values[0] < minimum:
            if hold_over is None:
                hold_over.append(g[weight].values[0])
                
            elif (sum(hold_over) + g[weight].values[0]) < minimum:
                hold_over.append(g[weight].values[0])
 
                
            elif (sum(hold_over) + g[weight].values[0]) >= minimum:
                hold_over.clear()
                bins.append(g[column].values[0])
                labels.append(g[column].values[0])
                
            
        elif g[weight].values[0] >= minimum:
            bins.append(g[column].values[0])
            labels.append(g[column].values[0])
    
    bins.pop()
    bins.append(np.inf)
    
    
    str_column = str(column)+str("_binned")
    # print(str_column)
    df[str_column] = pd.cut(df[column],
                            bins = bins,
                            labels = labels)
    

    return df