python resample incorrectly

92 Views Asked by At

I have a pandas dataframe like this:

                                    ticker_name  price
timestamp_column
2024-02-01 18:34:58  NIFTY_08-FEB-2024_CE_21000   85.0
2024-02-01 18:34:55  NIFTY_08-FEB-2024_CE_21000   84.0
2024-02-01 18:34:52  NIFTY_08-FEB-2024_CE_21000   83.0
2024-02-01 18:34:49  NIFTY_08-FEB-2024_CE_21000   82.0
2024-02-01 18:34:46  NIFTY_08-FEB-2024_CE_21000   81.0
...                                         ...    ...
2024-02-01 18:30:58  NIFTY_08-FEB-2024_CE_21000    5.0
2024-02-01 18:30:55  NIFTY_08-FEB-2024_CE_21000    4.0
2024-02-01 18:30:52  NIFTY_08-FEB-2024_CE_21000    3.0
2024-02-01 18:30:49  NIFTY_08-FEB-2024_CE_21000    2.0
2024-02-01 18:30:46  NIFTY_08-FEB-2024_CE_21000    1.0

I am trying to resample using the below code:

df_resample = df['price'].resample('10s').ohlc()

And I get:

                     open  high   low  close
timestamp_column
2024-02-01 18:30:40   1.0   2.0   1.0    2.0
2024-02-01 18:30:50   3.0   5.0   3.0    5.0
2024-02-01 18:31:00   6.0   8.0   6.0    8.0
2024-02-01 18:31:10   9.0  12.0   9.0   12.0

....
2024-02-01 18:34:20  73.0  75.0  73.0   75.0
2024-02-01 18:34:30  76.0  78.0  76.0   78.0
2024-02-01 18:34:40  79.0  82.0  79.0   82.0
2024-02-01 18:34:50  83.0  85.0  83.0   85.0

If you look at the data, for example the first row:

2024-02-01 18:30:40   1.0   2.0   1.0    2.0  

This should be the open-high-low-close between slot 18:30:30 to 18:30:40. But instead its 18:30:40 to 18:30:50. You can see this in every row, from the above data you can also see the same applies for the for last row as well.

Am I missing something? Or is this the intended behaviour? It looks wrong to me.

2

There are 2 best solutions below

0
Laurent B. On

Let's suppose we have the following dataframe :

>>> df
      timestamp_column                 ticker_name  price
0  2024-02-01 18:30:58  NIFTY_08-FEB-2024_CE_21000    5.0
1  2024-02-01 18:30:55  NIFTY_08-FEB-2024_CE_21000    4.0
2  2024-02-01 18:30:52  NIFTY_08-FEB-2024_CE_21000    3.0
3  2024-02-01 18:30:49  NIFTY_08-FEB-2024_CE_21000    2.0
4  2024-02-01 18:30:46  NIFTY_08-FEB-2024_CE_21000    1.0

Proposed code

import pandas as pd

data = {
    'timestamp_column': ['2024-02-01 18:30:58', '2024-02-01 18:30:55', '2024-02-01 18:30:52', '2024-02-01 18:30:49', '2024-02-01 18:30:46'],
    'ticker_name': ['NIFTY_08-FEB-2024_CE_21000', 'NIFTY_08-FEB-2024_CE_21000', 'NIFTY_08-FEB-2024_CE_21000', 'NIFTY_08-FEB-2024_CE_21000', 'NIFTY_08-FEB-2024_CE_21000'],
    'price': [5.0, 4.0, 3.0, 2.0, 1.0]
}

df = pd.DataFrame(data)

df['timestamp_column'] = pd.to_datetime(df['timestamp_column']) - pd.DateOffset(seconds=10)

df_resample = df.set_index('timestamp_column')['price'].resample('10s').ohlc()

###                      open  high  low  close
### timestamp_column                           
### 2024-02-01 18:30:30   1.0   2.0  1.0    2.0
### 2024-02-01 18:30:40   3.0   5.0  3.0    5.0
3
Corralien On

There is nothing wrong with your code, this is the expected behavior. It seems your dataframe is sorted in descending order so the first timestamp is "2024-02-01 18:30:46". The resample method returns "2024-02-01 18:30:40" as the first timestamp which is right.

If you want to apply an offset, you can do:

ohlc = df.resample('10s')['price'].ohlc()
ohlc.index -= pd.Timedelta('10s')
print(ohlc)

# Output
                     open  high   low  close
timestamp_column                            
2024-02-01 18:30:30   1.0   2.0   1.0    2.0
2024-02-01 18:30:40   3.0   5.0   3.0    5.0
...