Make Bins in pyspark with range of values and specific values

2.4k Views Asked by At

I want to make bins on a pyspark dataframe column with range of values and specific values

In this post How to bin in PySpark? It explains how yo make the bins, but it is only for range of values, from 0 to 6, from 6 to 18, and so on, I would like to have just a bin for specific values, for example, I want bins from 0 to 5, 5 to 9, 10, 11 to 20, 21 and more

+-----------+
| Age | Bin |
+-----------+
|  4  |  1  |
| 10  |  3  |
|  6  |  2  |
| 40  |  5  |
|  2  |  1  |
|  8  |  2  |
+-----------+

I tried

splits = [ 0, 5, 9, 10, 10, 11, float('Inf') ]

But It gives an error

I used a tolerance to make the bin just for 10, but I am wondering if there is a better way

tol= 1.0e-10
splits = [ 0, 5, 9, 10, 10+tol, 11, float('Inf') ]
1

There are 1 best solutions below

0
On

You can manually specify a CASE WHEN column for binning:

from functools import reduce

splits = [ 0, 5, 9, 10, 11 ]
splits = list(enumerate(splits))
# >>> splits
# [(0, 0), (1, 5), (2, 9), (3, 10), (4, 11)]

bins = reduce(lambda c, i: c.when(F.col('Age') <= i[1], i[0]), splits, F.when(F.col('Age') < splits[0][0], None)).otherwise(splits[-1][0] + 1).alias('bins')
# >>> bins
# Column<b'CASE WHEN (Age < 0) THEN NULL WHEN (Age <= 0) THEN 0 WHEN (Age <= 5) THEN 1 WHEN (Age <= 9) THEN 2 WHEN (Age <= 10) THEN 3 WHEN (Age <= 11) THEN 4 ELSE 5 END AS `bins`'>

df = df.select('age', bins)

df.show()
+---+----+
|age|bins|
+---+----+
|  4|   1|
| 10|   3|
|  6|   2|
| 40|   5|
|  2|   1|
|  8|   2|
+---+----+