I have a df that looks similar to this (shortened version, with less rows):

    Time (EDT)          Open    High    Low     Close
0   02.01.2006 19:00:00 0.85224 0.85498 0.85224 0.85498
1   02.01.2006 20:00:00 0.85498 0.85577 0.85423 0.85481
2   02.01.2006 21:00:00 0.85481 0.85646 0.85434 0.85646
3   02.01.2006 22:00:00 0.85646 0.85705 0.85623 0.85651
4   02.01.2006 23:00:00 0.85643 0.85691 0.85505 0.85653
5   03.01.2006 00:00:00 0.85653 0.8569  0.85601 0.85626
6   03.01.2006 01:00:00 0.85626 0.85653 0.85524 0.8557
7   03.01.2006 02:00:00 0.85558 0.85597 0.85486 0.85597
8   03.01.2006 03:00:00 0.85597 0.85616 0.85397 0.8548
9   03.01.2006 04:00:00 0.85469 0.85495 0.8529  0.85328
10  03.01.2006 05:00:00 0.85316 0.85429 0.85222 0.85401
11  03.01.2006 06:00:00 0.85401 0.8552  0.853   0.8552
12  03.01.2006 07:00:00 0.8552  0.8555  0.85319 0.85463
13  03.01.2006 08:00:00 0.85477 0.85834 0.8545  0.85788
14  03.01.2006 09:00:00 0.85788 0.85838 0.85341 0.85416
15  03.01.2006 10:00:00 0.8542  0.8542  0.85006 0.85111
16  03.01.2006 11:00:00 0.85115 0.85411 0.85    0.85345
17  03.01.2006 12:00:00 0.85337 0.85432 0.8526  0.85413
18  03.01.2006 13:00:00 0.85413 0.85521 0.85363 0.85363
19  03.01.2006 14:00:00 0.85325 0.8561  0.85305 0.85606
20  03.01.2006 15:00:00 0.8561  0.85675 0.85578 0.85599

I need to convert the date string to datetime, then set date column as index, and resample. When I use method 1, I can't resample properly, the data how it resamples is wrong and it creates extra future dates. Let say my last date is 2018-11, I will see 2018-12 something like that.

method 1:

df['Time (EDT)'] = pd.to_datetime(df['Time (EDT)'])  <---- this takes long also, because theres 90000 rows
df.set_index('Time (EDT)', inplace=True)

ohlc_dict = {'Open':'first','High':'max', 'Low':'min','Close'}
df=df.resample'4H', base=17, closed='left', label='left').agg(ohlc_dict)

result:

    Time (EDT)      Open    High    Low     Close
    1/1/2006 21:00  0.86332 0.86332 0.86268 0.86321
    1/2/2006 1:00   0.86321 0.86438 0.86111 0.86164
    1/2/2006 5:00   0.86164 0.86222 0.8585  0.86134
    1/2/2006 9:00   0.86149 0.86297 0.85695 0.85793
    1/2/2006 13:00  0.85801 0.85947 0.85759 0.8591
    1/2/2006 17:00  0.8591  0.86034 0.85757 0.85825
    1/2/2006 21:00  0.85825 0.85969 0.84377 0.84412
    1/3/2006 1:00   0.84445 0.8468  0.84286 0.84642
    1/3/2006 5:00   0.84659 0.8488  0.84494 0.84872
    1/3/2006 9:00   0.84829 0.84915 0.84271 0.84416
    1/3/2006 13:00  0.84372 0.8453  0.84346 0.84423
    1/3/2006 17:00  0.84426 0.84693 0.84426 0.84516
    1/3/2006 21:00  0.84523 0.8458  0.84442 0.84579

When I use method 2. It resamples properly.

method 2:

def to_datetime_obj(date_string):
     datetime_obj = datetime.strptime(date_string[:], '%d.%m.%Y %H:%M:%S')
     return datetime_obj

datetime_objs = None
date_list = df['Time (EDT)'].tolist()
datetime_objs=list(map(to_datetime_obj, date_list)) <--- this is faster also
df.iloc[:,:1] = datetime_objs
df.set_index('Time (EDT)', inplace=True)

ohlc_dict = {'Open':'first','High':'max', 'Low':'min','Close'}
df=df.resample'4H', base=17, closed='left', label='left').agg(ohlc_dict)

result:

Time (EDT)      Open    High    Low     Close
1/2/2006 17:00  0.85224 0.85577 0.85224 0.85481
1/2/2006 21:00  0.85481 0.85705 0.85434 0.85626
1/3/2006 1:00   0.85626 0.85653 0.8529  0.85328
1/3/2006 5:00   0.85316 0.85834 0.85222 0.85788
1/3/2006 9:00   0.85788 0.85838 0.85    0.85413
1/3/2006 13:00  0.85413 0.85675 0.85305 0.85525
1/3/2006 17:00  0.85525 0.85842 0.85502 0.85783
1/3/2006 21:00  0.85783 0.85898 0.85736 0.85774
1/4/2006 1:00   0.85774 0.85825 0.8558  0.85595
1/4/2006 5:00   0.85595 0.85867 0.85577 0.85839
1/4/2006 9:00   0.85847 0.85981 0.85586 0.8578
1/4/2006 13:00  0.85773 0.85886 0.85597 0.85653
1/4/2006 17:00  0.85653 0.85892 0.85642 0.8584
1/4/2006 21:00  0.8584  0.85863 0.85658 0.85715
1/5/2006 1:00   0.85715 0.8588  0.85641 0.85791
1/5/2006 5:00   0.85803 0.86169 0.85673 0.86065
  1. The df.index of method 1 and 2 are the same visually before resampling.
  2. They are both pandas.core.indexes.datetimes.DatetimeIndex
  3. But when I compare them, they are actually different method1_df.index != method2_df.index

Why is that? How to fix? Thanks.

2

There are 2 best solutions below

5
On

It's surprising that a vectorized method (pd.to_datetime), written in Cython is slower than a pure Python method (datetime.strptime).

You can specify the format to pd.to_datetime whicch speeds it up a lot:

pd.to_datetime(df['Time (EDT)'], format='%d.%m.%Y %H:%M:%S')

For your second problem, I think it may have something to do with the order of day and month in your string data. Have you verified that the two methods actually give you the same datetimes?

s1 = pd.to_datetime(df['Time (EDT)'])
s2 = pd.Series(map(to_datetime_obj, date_list))
(s1 == s2).all()
1
On

For me datetime.strptime was 3 times faster than pd.to_datetime for 2 operations per row on a 880,000+ rows DataFrame.