Dividing a time-format column into specified number of discrete intervals in hive

87 Views Asked by At

I have a column names ViolationTime in my Hive table. It contains time in 24-hour HHmm format, for example 1424. The table contains 10 million rows. I want to divide it into 6 discrete groups to perform operations.

I tried using ntile, but it will divide the values based on ascending or descending order. I'd like this column to be divided in discrete intervals.

1

There are 1 best solutions below

0
On

In Hive 3.0 and newer, the width_bucket() function does that:

Return type: INT
Name (signature): width_bucket(NUMERIC expr, NUMERIC min_value, NUMERIC max_value, INT num_buckets)
Description: Returns an integer between 0 and num_buckets+1 by mapping expr into the ith equally sized bucket. Buckets are made by dividing [min_value, max_value] into equally sized regions. If expr < min_value, return 1, if expr > max_value return num_buckets+1.

Although you may find that you need to convert your HHmm time values to INTs first (e.g. number of seconds since midnight), to make it work perfectly well.