Discretization : converting continuous values into a certain number of categories

994 Views Asked by At
1   Create a column Usage_Per_Year from Miles_Driven_Per_Year by discretizing the values into three equally sized categories. The names of the categories should be Low, Medium, and High.

2   Group by Usage_Per_Year and print the group sizes as well as the ranges of each.

3   Do the same as in #1, but instead of equally sized categories, create categories that have the same number of points per category.

4   Group by Usage_Per_Year and print the group sizes as well as the ranges of each.

My codes are below

df["Usage_Per_Year "], bins = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2, retbins=True)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.2
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))
#3.3.3
Year2 = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.4
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))

the out put is below:

               Usage_Per_Year     0 Low       (-1925.883, 663476.235]  6018 Medium  (663476.235, 1326888.118]     0 High     (1326888.118, 1990300.0]     1
               Usage_Per_Year     0 Low       (-1925.883, 663476.235]  6018 Medium  (663476.235, 1326888.118]     0 High     (1326888.118, 1990300.0]     1

but -1925 is wrong...

The right answer should be like this. enter image description here

How can I do...

1

There are 1 best solutions below

0
On BEST ANSWER

Maybe a typo on line 1: df["Usage_Per_Year "]? There is a space at the end of the column name.

pd.cut bins values into equal size. That's why all of your bins have same size. It seems that you should compute the min and max of each group after binning.

Also, to bin value into equal frequency, you should use pd.qcut.


Example input:

import numpy as np
import pandas as pd

rng = np.random.default_rng(20210514)
df = pd.DataFrame({
    'Miles_Driven_Per_Year': rng.gamma(1.05, 10000, (1000,)).astype(int)
})

# 1
group_label = ['Low', 'Medium', 'High']
df['Usage_Per_Year'] = pd.cut(df['Miles_Driven_Per_Year'],
                              bins=3, labels=group_label)

# 2
print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))

# 3
df['Usage_Per_Year'] = pd.qcut(df['Miles_Driven_Per_Year'],
                               q=3, labels=group_label)

# 4
print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))

Example output:

               Miles_Driven_Per_Year              
                               count    min    max
Usage_Per_Year                                    
Low                              878     31  20905
Medium                           107  20955  41196
High                              15  41991  62668
               Miles_Driven_Per_Year              
                               count    min    max
Usage_Per_Year                                    
Low                              334     31   4378
Medium                           333   4449  11424
High                             333  11442  62668