How to build "product" datetime index from a list of dates and a list of times in pandas?

86 Views Asked by At

I have

  • a list of dates (size=7713): [date(1994,5,25), ..., date(2023,12,19)]
  • a list of times (basically 5 minute interval of a day, size=288): [time(0, 0), ..., time(23, 55)]

I want to build a DatetimeIndex, where its full "product" combine, i.e., combine all time with each date. In the end, the size of the index is 7713 * 288.

My current solution is

datetimes = [pd.Timestamp.combine(d, t) for d in dates for t in times]
index = pd.DatetimeIndex(datetimes)

However it is super slow

  • combining takes around 9 seconds
  • building index takes around 9 seconds again

I tried directly give datetimes to pd.DataFrame(index=datetimes, ....), it takes 6 seconds (better than 9 seconds if I build index beforehand).

Do we have much faster way for this?

1

There are 1 best solutions below

2
On BEST ANSWER

Your code is slow because it runs pd.Timestamp.combine in a double for loop.

Instead, use pd.merge with type cross to build the "product" first. Then do either datetime.combine or a string summation as vector operations to create a joint set. Is either one of these two faster?

from datetime import date, time, datetime
import pandas as pd

a_df = pd.DataFrame([date(1994,5,25), date(1999,6,14), date(2005,12,8)])
b_df = pd.DataFrame([time(0, 0), time(16, 24), time(23, 55)])

# step 1: build product with "merge"
tmp = pd.merge(a_df, b_df, how='cross')

# step 2a: collapse both cols with "datetime.combine"
my_index_1 = tmp.apply(lambda r : datetime.combine(r['0_x'],r['0_y']),1)

# step 2b: alternatively, collapse both cols via string conversion
my_index_2 = pd.to_datetime(tmp['0_x'].astype(str) + ' ' + tmp['0_y'].astype(str))

This gives you the desired result:

print(my_index)
0   1994-05-25 00:00:00
1   1994-05-25 16:24:00
2   1994-05-25 23:55:00
3   1999-06-14 00:00:00
4   1999-06-14 16:24:00
5   1999-06-14 23:55:00
6   2005-12-08 00:00:00
7   2005-12-08 16:24:00
8   2005-12-08 23:55:00
dtype: datetime64[ns]

Or, keep the date and time separate as a multiindex (depending on what analysis is coming down the road, maybe pivoting directly on time of day is good enough). Should spare you the effort/time needed to combine:

tmp.set_index(['0_x', '0_y']).index
MultiIndex([(1994-05-25, 00:00:00),
            (1994-05-25, 16:24:00),
            (1994-05-25, 23:55:00),
            (1999-06-14, 00:00:00),
            (1999-06-14, 16:24:00),
            (1999-06-14, 23:55:00),
            (2005-12-08, 00:00:00),
            (2005-12-08, 16:24:00),
            (2005-12-08, 23:55:00)],
           names=['0_x', '0_y'])