Resampling with date and hour ranges

62 Views Asked by At

I am trying to resample a dataframe and compute aggregated values. During the resampling process, I want to be able to exclude:

  • specific time ranges during each day, format: %H:%M:%S
  • specific days during the period, format %YYYY-MM-DD
def _ohlc(
        trades_df,
        price_col_name,
) -> dd.DataFrame:
    # Extract the prices column from the trades df
    prices_df = trades_df[price_col_name]

    # Resample trades into OHLC format of specified frequency
    resampled_prices_df = prices_df.resample(f'1S')  # <-- This line produces a row for each second
    ohlc = resampled_prices_df.ohlc()

    # Forward fill prices
    filled_ohlc = ohlc.ffill()

    return filled_ohlc

What is the fastest/most efficient way to achieve a resampled dataframe, excluding specified intraday time range and date range?

There are a couple of answers on stack that are simply too slow. Looping over values via .apply() simply takes too much time. I'm currently using Dask, for processing - but at this point any efficient solution with any library is appreciated.

Context:
Trying to compute OHLCV data from Trades data, excluding the times during a day where the market is closed and the days when the market are closed (holidays)

Considerations:

  • The df index is of type datetime64[ns]
  • The raw file has over 10B rows, data doesn't fit into memory.
  • If possible, I want to avoid non-vectorized operations because of cost concerns.
0

There are 0 best solutions below