Pandas DF - Cut time b/w 2 timestamps into hour bins

322 Views Asked by At

Say I have data of this format in a df

id      sta                   end                   dur
40433   2020-01-08 05:06:01   2020-01-08 05:08:14   133
40433   2020-09-22 12:01:26   2020-09-22 12:31:34   1808
40433   2020-09-22 12:05:00   2020-09-22 13:05:00   3600

Maybe in the same df or in a new df, I want to add records that look like this:

id      sta                  end                   h1  dur
40433   2020-01-08 05:06:01  2020-01-08 05:08:14   05  133
40433   2020-09-22 12:01:26  2020-09-22 12:31:34   12  1808
40433   2020-09-22 12:05:00  2020-09-22 13:05:00   12  3300
40433   2020-09-22 12:05:00  2020-09-22 13:05:00   13  300

dur is in seconds.

I want to groupby id, then day (extracted from sta), then h1, h2, etc., aggregating the dur for the particular hours (h1, etc.), for that id.

1

There are 1 best solutions below

7
On

revised answer based on your comments. For a quicker turn around, I went to array math with a few conversions after trying a few other methods. There may be a more efficient way, not sure how it performs at scale, but it works. Has the caveat that if your duration goes over 24 hours total, ALL hour column values will be all 60 minutes, so I left that condition alone so that you may deal with it as you see fit:

import cudf
import cupy as cp

#If your duration goes over 24 hours total, ALL hour column values will be all 60 minutes.

sta = ['2020-01-08 05:06:01', '2020-09-22 12:01:26', '2020-09-22 12:05:00', '2020-09-22 01:15:00', '2020-09-22 21:05:00']
end = ['2020-01-08 05:08:14', '2020-09-22 12:31:34', '2020-09-22 13:05:00', '2020-09-22 08:05:00', '2020-09-23 01:05:00']

#put it in a dataframe
df = cudf.DataFrame({'sta': sta, 'end':end})
print(df.head())

#the object is a string, so let's convert it to date time
df['sta']= df['sta'].astype('datetime64[s]')
df['end']=df['end'].astype('datetime64[s]')

df['dur']=(df['end']-df['sta']).astype('int64')

#create new df of same type to convert to cupy (to preserve datetime values)
df2=cudf.DataFrame() 
df2['dur']=(df['end']-df['sta']).astype('int64')
df2['min_sta'] =df['sta'].dt.minute.astype('int64')
df2['min_end']= df['end'].dt.minute.astype('int64')
df2['h_sta']= df['sta'].dt.hour.astype('int64')
df2['h_end']= df['end'].dt.hour.astype('int64')
df2['day']=df['sta'].dt.day.astype('int64')
print(df2)

#convert df2's values from df to cupy array (you can use numpy if on pandas)
a = cp.fromDlpack(df2.to_dlpack())
print(a)

#create new temp cupy array b to contain minute duration per hour.  This algo will work with numpy by using mumpy instead of cupy
b = cp.zeros((len(a),24))
for j in range(0,len(a)):
    hours = int((a[j][0]/3600)+(a[j][1]/60))
    if(hours==0): # within same hour
        b[j][a[j][3]] = int(a[j][0]/60)
    elif(hours==1): #you could probably delete this condition.
        b[j][a[j][3]] = 60-a[j][1]
        b[j][a[j][4]] = a[j][2]
    else:
        b[j][a[j][3]] = 60-a[j][1]
        if(hours<24): #all array elements will be all 60 minutes if duration is over 24 hours
            if(a[j][3]+hours<24):
                b[j][a[j][3]+1:a[j][3]+hours]=60
                b[j][a[j][4]] = a[j][2]
            else:
                b[j][a[j][3]+1:24]=60
                b[j][0:(a[j][3]+1+hours)%24]=60
                b[j][a[j][4]] = a[j][2]
# bring cupy array b back to a df. 
reshaped_arr = cp.asfortranarray(b)
cpdf = cudf.from_dlpack(reshaped_arr.toDlpack())
print(cpdf.head())

#concat the original and cupy df
df = cudf.concat([df, cpdf], axis=1)
print(df.head())
#you can rename the columns with "h" as you wish