I have a df that looks similar to this (shortened version, with less rows):
Time (EDT) Open High Low Close
0 02.01.2006 19:00:00 0.85224 0.85498 0.85224 0.85498
1 02.01.2006 20:00:00 0.85498 0.85577 0.85423 0.85481
2 02.01.2006 21:00:00 0.85481 0.85646 0.85434 0.85646
3 02.01.2006 22:00:00 0.85646 0.85705 0.85623 0.85651
4 02.01.2006 23:00:00 0.85643 0.85691 0.85505 0.85653
5 03.01.2006 00:00:00 0.85653 0.8569 0.85601 0.85626
6 03.01.2006 01:00:00 0.85626 0.85653 0.85524 0.8557
7 03.01.2006 02:00:00 0.85558 0.85597 0.85486 0.85597
8 03.01.2006 03:00:00 0.85597 0.85616 0.85397 0.8548
9 03.01.2006 04:00:00 0.85469 0.85495 0.8529 0.85328
10 03.01.2006 05:00:00 0.85316 0.85429 0.85222 0.85401
11 03.01.2006 06:00:00 0.85401 0.8552 0.853 0.8552
12 03.01.2006 07:00:00 0.8552 0.8555 0.85319 0.85463
13 03.01.2006 08:00:00 0.85477 0.85834 0.8545 0.85788
14 03.01.2006 09:00:00 0.85788 0.85838 0.85341 0.85416
15 03.01.2006 10:00:00 0.8542 0.8542 0.85006 0.85111
16 03.01.2006 11:00:00 0.85115 0.85411 0.85 0.85345
17 03.01.2006 12:00:00 0.85337 0.85432 0.8526 0.85413
18 03.01.2006 13:00:00 0.85413 0.85521 0.85363 0.85363
19 03.01.2006 14:00:00 0.85325 0.8561 0.85305 0.85606
20 03.01.2006 15:00:00 0.8561 0.85675 0.85578 0.85599
I need to convert the date string to datetime, then set date column as index, and resample. When I use method 1, I can't resample properly, the data how it resamples is wrong and it creates extra future dates. Let say my last date is 2018-11, I will see 2018-12 something like that.
method 1:
df['Time (EDT)'] = pd.to_datetime(df['Time (EDT)']) <---- this takes long also, because theres 90000 rows
df.set_index('Time (EDT)', inplace=True)
ohlc_dict = {'Open':'first','High':'max', 'Low':'min','Close'}
df=df.resample'4H', base=17, closed='left', label='left').agg(ohlc_dict)
result:
Time (EDT) Open High Low Close
1/1/2006 21:00 0.86332 0.86332 0.86268 0.86321
1/2/2006 1:00 0.86321 0.86438 0.86111 0.86164
1/2/2006 5:00 0.86164 0.86222 0.8585 0.86134
1/2/2006 9:00 0.86149 0.86297 0.85695 0.85793
1/2/2006 13:00 0.85801 0.85947 0.85759 0.8591
1/2/2006 17:00 0.8591 0.86034 0.85757 0.85825
1/2/2006 21:00 0.85825 0.85969 0.84377 0.84412
1/3/2006 1:00 0.84445 0.8468 0.84286 0.84642
1/3/2006 5:00 0.84659 0.8488 0.84494 0.84872
1/3/2006 9:00 0.84829 0.84915 0.84271 0.84416
1/3/2006 13:00 0.84372 0.8453 0.84346 0.84423
1/3/2006 17:00 0.84426 0.84693 0.84426 0.84516
1/3/2006 21:00 0.84523 0.8458 0.84442 0.84579
When I use method 2. It resamples properly.
method 2:
def to_datetime_obj(date_string):
datetime_obj = datetime.strptime(date_string[:], '%d.%m.%Y %H:%M:%S')
return datetime_obj
datetime_objs = None
date_list = df['Time (EDT)'].tolist()
datetime_objs=list(map(to_datetime_obj, date_list)) <--- this is faster also
df.iloc[:,:1] = datetime_objs
df.set_index('Time (EDT)', inplace=True)
ohlc_dict = {'Open':'first','High':'max', 'Low':'min','Close'}
df=df.resample'4H', base=17, closed='left', label='left').agg(ohlc_dict)
result:
Time (EDT) Open High Low Close
1/2/2006 17:00 0.85224 0.85577 0.85224 0.85481
1/2/2006 21:00 0.85481 0.85705 0.85434 0.85626
1/3/2006 1:00 0.85626 0.85653 0.8529 0.85328
1/3/2006 5:00 0.85316 0.85834 0.85222 0.85788
1/3/2006 9:00 0.85788 0.85838 0.85 0.85413
1/3/2006 13:00 0.85413 0.85675 0.85305 0.85525
1/3/2006 17:00 0.85525 0.85842 0.85502 0.85783
1/3/2006 21:00 0.85783 0.85898 0.85736 0.85774
1/4/2006 1:00 0.85774 0.85825 0.8558 0.85595
1/4/2006 5:00 0.85595 0.85867 0.85577 0.85839
1/4/2006 9:00 0.85847 0.85981 0.85586 0.8578
1/4/2006 13:00 0.85773 0.85886 0.85597 0.85653
1/4/2006 17:00 0.85653 0.85892 0.85642 0.8584
1/4/2006 21:00 0.8584 0.85863 0.85658 0.85715
1/5/2006 1:00 0.85715 0.8588 0.85641 0.85791
1/5/2006 5:00 0.85803 0.86169 0.85673 0.86065
- The df.index of method 1 and 2 are the same visually before resampling.
- They are both
pandas.core.indexes.datetimes.DatetimeIndex
- But when I compare them, they are actually different
method1_df.index != method2_df.index
Why is that? How to fix? Thanks.
It's surprising that a vectorized method (
pd.to_datetime
), written in Cython is slower than a pure Python method (datetime.strptime
).You can specify the format to
pd.to_datetime
whicch speeds it up a lot:For your second problem, I think it may have something to do with the order of day and month in your string data. Have you verified that the two methods actually give you the same datetimes?