In my code, I have created 10 bins (specific ranges of bins are listed below):

  1. 4100000-4155304

  2. 4155304-4210608

  3. 4210608-4321216

  4. 4321216-4542432

  5. 4542432-4984865

  6. 4984865-5327533

  7. 5327533-5670201

  8. 5670201-5746217

  9. 5746217-5873109

  10. 5873109-6000000

    bins = [4100000,4155304,4210608,4321216,4542432,4984865,5327533,5670201,5746217,5873109,6000000]
    bin_indices = np.digitize(bins_array, bins)
    

Is there a way I can do this without having to list all the bin numbers (bins = [bin numbers]), and maybe also without having to use np.digitize? Thank you very much!

2

There are 2 best solutions below

0
On

Simply use the numpy.arange method:

bins = np.arange(4100000, 6000000, 55304)
bins

Output

array([4100000, 4155304, 4210608, 4265912, 4321216, 4376520, 4431824,
       4487128, 4542432, 4597736, 4653040, 4708344, 4763648, 4818952,
       4874256, 4929560, 4984864, 5040168, 5095472, 5150776, 5206080,
       5261384, 5316688, 5371992, 5427296, 5482600, 5537904, 5593208,
       5648512, 5703816, 5759120, 5814424, 5869728, 5925032, 5980336])

Cheers

0
On

I cant find the original author of a different SO post where I got this from using Pandas but maybe try something like this below that I thru together really fast for an idea to try. The data frame is just numpy random range to generate the fake data in the ranges you are looking for.

import pandas as pd
import numpy as np

#create bins & categories for data ranges
cats = ['4100000_4155303',
        '4155304_4210608',
        '4210608_4321215',
        '4321216_4542431',
        '4542432_4984864',
        '4984865_5327532',
        '5327533_5670200',
        '5670201_5746216',
        '5746217_5873108',
        '5873109_6000000']

bins = [0,
        4100000,
        4210608,
        4321215,
        4542431,
        4984864,
        5327532,
        5670200,
        5746216,
        5873108,
        6000000]


def binn(df):
    df = (df.groupby([df.index, pd.cut(df['A'], bins, labels=cats)])
                .size()
                .unstack(fill_value=0)
                .reindex(columns=cats, fill_value=0))
    return df


rng = np.random.default_rng()
df = pd.DataFrame(rng.integers(4155304, 6000000, size=(1000, 1)), columns=list('A'))

dfBinned = binn(df)

print('All data binned in column A of the df')
print(dfBinned.sum(axis = 0))

This prints:

All data binned in column A of the df
A
4100000_4155303      0
4155304_4210608     35
4210608_4321215     42
4321216_4542431    130
4542432_4984864    239
4984865_5327532    174
5327533_5670200    205
5670201_5746216     37
5746217_5873108     63
5873109_6000000     75
dtype: int64