I am trying to resample a dataframe and compute aggregated values. During the resampling process, I want to be able to exclude:
- specific time ranges during each day, format:
%H:%M:%S - specific days during the period, format
%YYYY-MM-DD
def _ohlc(
trades_df,
price_col_name,
) -> dd.DataFrame:
# Extract the prices column from the trades df
prices_df = trades_df[price_col_name]
# Resample trades into OHLC format of specified frequency
resampled_prices_df = prices_df.resample(f'1S') # <-- This line produces a row for each second
ohlc = resampled_prices_df.ohlc()
# Forward fill prices
filled_ohlc = ohlc.ffill()
return filled_ohlc
What is the fastest/most efficient way to achieve a resampled dataframe, excluding specified intraday time range and date range?
There are a couple of answers on stack that are simply too slow.
Looping over values via .apply() simply takes too much time. I'm currently using Dask, for processing - but at this point any efficient solution with any library is appreciated.
Context:
Trying to compute OHLCV data from Trades data, excluding the times during a day where the market is closed and the days when the market are closed (holidays)
Considerations:
- The df index is of type
datetime64[ns] - The raw file has over 10B rows, data doesn't fit into memory.
- If possible, I want to avoid non-vectorized operations because of cost concerns.