Calculate Relative Volume Ratio indicator in pandas data frame and add the indicator value as new column

97 Views Asked by At

I know there have been a few posts on this, but my case is a little bit different and I wanted to get some help on this.

I have a pandas dataframe symbol_df with 1 min bars in the below format for each stock symbol:

id  Symbol_id                Date     Open     High      Low    Close  Volume
1          1 2023-12-13 09:15:00  4730.95  4744.00  4713.95  4696.40    2300
2          1 2023-12-13 09:16:00  4713.20  4723.70  4717.85  4702.55    1522
3          1 2023-12-13 09:17:00  4716.40  4718.55  4701.00  4701.00     909
4          1 2023-12-13 09:18:00  4700.15  4702.80  4696.70  4696.00     715
5          1 2023-12-13 09:19:00  4696.70  4709.90  4702.00  4696.10     895
...        ...                 ...      ...      ...      ...      ...     ...
108001     1 2024-03-27 13:44:00  6289.95  6291.95  6289.00  6287.55     989
108002     1 2024-03-27 13:45:00  6288.95  6290.85  6289.00  6287.75     286
108003     1 2024-03-27 13:46:00  6291.25  6293.60  6292.05  6289.10    1433
108004     1 2024-03-27 13:47:00  6295.00  6299.00  6293.20  6293.15    2702
108005     1 2024-03-27 13:48:00  6292.05  6296.55  6291.95  6291.95     983

I would like to calculate the "Relative Volume Ratio" indicator and add this calculated value to the symbol_df as a new column on a rolling basis.

"Relative volume ratio" indicator calculated as below:

So far today's Volume is compared with the mean volume of the last 10 days of the same period. To get the ratio value, we simply divide "today so far volume" by "mean volume of the last 10 days of the same period".

For example..the current bar time is now 13:48.

cumulativeVolumeOfToday = Volume of 1 minuite bars between 00:00 -13:48 today added up

avergeVolumeOfPreviousDaysOfSamePeriod = Average accumulation of volume from the same period(00:00 - 13:48) over the last 10 days.

relativeVolumeRatio = CumulativeVolumeOfToday/AvergeVolumeOfPrevious10DaysOfSamePeriod

Add this value as a new column to the dataframe.

Sample data download for the test case:

import yfinance as yf #pip install yfinance
from datetime import datetime
import pandas as pd

symbol_df = yf.download(tickers="AAPL", period="7d", interval="1m")["Volume"] 
symbol_df=symbol_df.reset_index(inplace=False)
#symbol_df['Datetime'] = symbol_df['Datetime'].dt.strftime('%Y-%m-%d %H:%M')
symbol_df = symbol_df.rename(columns={'Datetime': 'Date'})
#We can only download 7 days sample data. So 5 days mean for calculations

How can I do this in Pandas?

2

There are 2 best solutions below

5
Vitalizzare On BEST ANSWER

TL;DR

from yfinance import download

# Prepare data similar to the original
symbol_df = (
    download(tickers="AAPL", period="7d", interval="1m")
    .rename_axis(index='Date')
    .reset_index()
)

# Calculate Relative Volume Ratio
volume = symbol_df.set_index('Date')['Volume']
dts = volume.index
cum_volume = volume.groupby(dts.date, sort=False).cumsum()
prev_mean = lambda days: (
    cum_volume
    .groupby(dts.time, sort=False)
    .rolling(days, closed='left')
    .mean()
    .reset_index(0, drop=True)    # drop the level with dts.time
)
rvr = cum_volume / prev_mean(5)

# Assign the output to the initial data
symbol_df = symbol_df.join(rvr.rename('Relative volume ratio'), on='Date')

Explanation

Based on the provided description, you need to perform several transformations on the aggregated data. First is to cumulatively summarize the data for each day. Then run a [ten]-day window over the data grouped by time of day to calculate the average. And at the end, actually divide the former by the latter.

Let's say, you have the following test data, where "Date" is a column of type datetime:

from yfinance import download

symbol_df = (
    download(tickers="AAPL", period="7d", interval="1m")
    .rename_axis(index='Date')
    .reset_index()
)

To calculate the Relative Volume Ratio values, we will use "Volume" as a separate sequence with date-time stamps "Date" as its index:

volume = symbol_df.set_index('Date')['Volume']
dts = volume.index    # date-time stamps for convenient grouping

Let's create a sequence of cumulative volumes for each day. For this, we group volume by its date (the year, month and day values with no time) and apply cumsum to a group (use sort=False in hopes to speed up calculations):

cum_volume = volume.groupby(dts.date, sort=False).cumsum()

To calculate the mean of cumulative volumes at the same time of day in the given number of previous days, we group cum_volume by its time (hours and minutes with no year, month, day values), and apply rolling calculations to each group to obtain averages over windows. Note that here we need the source data to be sorted by date-time stamps since only business days are taken into account and we can't use a non-fixed frequency of "10B" as a window value. To calculate means for exactly the previous days excluding the current one, we pass closed='left' (see DataFrameGroupBy.rolling docs for details):

prev_mean = lambda days: (
    cum_volume
    .groupby(dts.time, sort=False)
    .rolling(days, closed='left')
    .mean()
    .reset_index(0, drop=True)
)

Now the final touch with the window of 5 days:

rvr = cum_volume / prev_mean(5)

Comparison

Compared to Andrei Kesely's solution, this one wins in speed (on Intel Core i3-2100, for example, processing the data offered there will take over 1 minute versus 300-400 ms with the code above). The calculation result is the same for timestamps after the first 10 days. But in the beginning, when there's less then 10 previous days, calculation of mean in rolling windows is made as if there's always 10 items (missing values are set to nan). Whereas in the case of the Kesely's solution, we obtain average values only for the available cumulative volumes.

4
Andrej Kesely On

I hope I've understood your question right:

  1. compute cumulative sum of "Volume" for today (that is, from 00:00 to current datetime)
  2. compute mean of sums of Volumes for last 10 days from 00:00 to time (derived from current time)
  3. divide the cumulative sum of "Volume" for today with the mean of sums from last 10 days
import numpy as np
import pandas as pd
from tqdm import tqdm


df = df.set_index("Date")


cache = {}


def get_volume_sum(df, dt):
    rv = cache.get(dt, None)
    if rv is None:
        rv = df.loc[dt.date() : dt, "Volume"].sum()
        cache[dt] = rv
    return rv


out, N = [], 10
for d in tqdm(df.index):
    cumulativeVolumeOfToday = get_volume_sum(df, d)
    vals = []
    for day in range(N, 0, -1):
        dt = d - pd.Timedelta(days=day)
        vals.append(get_volume_sum(df, dt))
    if not vals:
        out.append(np.nan)
    else:
        AvergeVolumeOfPrevious10DaysOfSamePeriod = np.mean(np.array(vals))
        if AvergeVolumeOfPrevious10DaysOfSamePeriod != 0:
            out.append(
                cumulativeVolumeOfToday / AvergeVolumeOfPrevious10DaysOfSamePeriod
            )
        else:
            out.append(np.nan)

df["relativeVolumeRatio"] = out
print(df.tail(10))

Prints (the computation took ~21 seconds for 150k of records on my machine AMD 5700x):

100%|███████████████████████████████████| 151474/151474 [00:21<00:00, 7136.78it/s]

                     Symbol_id  Open  Close  Volume  relativeVolumeRatio
Date                                                                    
2024-03-27 13:39:00          1  3520   4340     731             0.993289
2024-03-27 13:40:00          1  4868   4748     131             0.992341
2024-03-27 13:41:00          1  5833   5342    1828             0.992327
2024-03-27 13:42:00          1  4006   3419    1755             0.992667
2024-03-27 13:43:00          1  3465   3827     690             0.992588
2024-03-27 13:44:00          1  5624   4924    2315             0.993013
2024-03-27 13:45:00          1  5012   5321    2961             0.993927
2024-03-27 13:46:00          1  3292   4193     112             0.992635
2024-03-27 13:47:00          1  4823   5520    2098             0.993061
2024-03-27 13:48:00          1  4937   4560     679             0.992526

The dataframe was created with:

np.random.seed(42)


def generate_df(from_, to_):
    dr = pd.date_range(from_, to_, freq="1min")
    return pd.DataFrame(
        {
            "Symbol_id": [1] * len(dr),
            "Date": dr,
            "Open": np.random.randint(3000, 6000, size=len(dr)),
            "Close": np.random.randint(3000, 6000, size=len(dr)),
            "Volume": np.random.randint(100, 3000, size=len(dr)),
        }
    )


df = generate_df("2023-12-13 09:15:00", "2024-03-27 13:48:00")