cuDF - groupby UDF to support datetime

465 Views Asked by At

I have a cuDF dataframe with following columns:

columns = ["col1", "col2", "dt"]

The (dt) in the form of datetime64[ns].

I would like to write a UDF to apply to each group in this dataframe, and get max of dt for each group. Here is what I am trying, but seems like numba doesn't support the datetime64[ns] values in UDFs.

def f1(dt, out):
   l = len(dt)
   maxvalue = dt[0]
   for i in  range(cuda.threadIdx.x, l, cuda.blockDim.x):
      if dt[i] > maxvalue:
         maxvalue = dt[i]
   out[:0] = maxvalue

gdf = df.groupby(["col1", "col2"], method="cudf")
df = gdf.apply_grouped(f1, incols={"dt": "dt"}, outcols=dict(out=numpy.datetime64))

Here is the error I get:

This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function(<numba.cuda.compiler.DeviceFunctionTemplate object at 0x7effda063510>)
[2] During: typing of call at <string> (10)

I have similar functions, which work fine with integers and floats. Does it mean that numba doesnt support datetimes?

1

There are 1 best solutions below

0
On

Apply_groups won't give you what I think you're after, which is groupby on max dt. You needed to use aggs with max on dt. cudf's groupby functions would have done the rest. To get your values in datetime64[ms], you use astype(), and save it back to the dataframe (very fast). See my example:

import cudf
a = cudf.DataFrame({"col1": [1, 1, 1, 2, 2, 2], "col2": [1, 2, 1, 1, 2, 1], "dt": [10000000, 2000000, 3000000, 100000, 2000000, 40000000]}) 
a['dt'] = a['dt'].astype('datetime64[ns]')
print(a)
a['dt'] = a['dt'].astype('datetime64[ms]')
print(a)
gdf = a.groupby(["col1", "col2"]).agg({'dt':'max'})
print(gdf.head())

dt column values would be formatted to between 0.1-40 milliseconds as nanoseconds from Jan 1st, 1970, giving you a print out of

   col1  col2                         dt
0     1     1 1970-01-01 00:00:00.010000
1     1     2 1970-01-01 00:00:00.002000
2     1     1 1970-01-01 00:00:00.003000
3     2     1 1970-01-01 00:00:00.000100
4     2     2 1970-01-01 00:00:00.002000
5     2     1 1970-01-01 00:00:00.040000

   col1  col2                      dt
0     1     1 1970-01-01 00:00:00.010
1     1     2 1970-01-01 00:00:00.002
2     1     1 1970-01-01 00:00:00.003
3     2     1 1970-01-01 00:00:00.000
4     2     2 1970-01-01 00:00:00.002
5     2     1 1970-01-01 00:00:00.040

                               dt
col1 col2                        
1    1    1970-01-01 00:00:00.010
     2    1970-01-01 00:00:00.002
2    1    1970-01-01 00:00:00.040
     2    1970-01-01 00:00:00.002