I want to groupby "ts_code" and calculate percentage between one column max and min value from another column after max based on last N rows for each group. Specifically,
df
ts_code high low
0 A 20 10
1 A 30 5
2 A 40 20
3 A 50 10
4 A 20 30
5 B 20 10
6 B 30 5
7 B 40 20
8 B 50 10
9 B 20 30
Goal
Below is my expected result
ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg
0 A 20 10 NA NA
1 A 30 5 NA NA
2 A 40 20 0.5 NA
3 A 50 10 0.8 0.8
4 A 20 30 0.8 0.8
5 B 50 10 NA NA
6 B 30 5 NA NA
7 B 40 20 0.9 NA
8 B 10 10 0.75 0.9
9 B 20 30 0.75 0.75
ln_high_low_pct_chg(such as l3_high_low_pct_chg)= 1-(the min value of the low column after the peak)/(the max value of high column),on last N rows for each group and each row.
Try and problem
df['l3_highest']=df.groupby('ts_code')['high'].transform(lambda x: x.rolling(3).max())
df['l3_lowest']=df.groupby('ts_code')['low'].transform(lambda x: x.rolling(3).min())
df['l3_high_low_pct_chg']=1-df['l3_lowest']/df['l3_highest']
But it fails such that for second row, the l3_lowest would be 5 not 20. I don't know how to calculate percentage after peak.
For last 4 rows, at index=8, low=10,high=50,low=5, l4_high_low_pct_chg=0.9
, at index=9, high=40, low=10, l4_high_low_pct_chg=0.75
- If the rolling window is 52, for hy_code
880912group and index 1252,l52_high_low_pct_chgwould be 0.281131 and880301group and index 1251,l52_high_low_pct_chgwould be 0.321471.
Grouping by 'ts_code' is just a trivial groupby() function. DataFrame.rolling() function is for single columns, so it's a tricky to apply it if you need data from multiple columns. You can use "from numpy_ext import rolling_apply as rolling_apply_ext" as in this example: Pandas rolling apply using multiple columns. However, I just created a function that manually groups the dataframe into n length sub-dataframes, then applies the function to calculate the value. idxmax() finds the index value of the peak of the low column, then we find the min() of the values that follow. The rest is pretty straightforward.
If you prefer to use the rolling function, this method gives the same output:
Finally, if you want to do a true rolling window calculation that avoids any index lookup, you can use the numpy_ext (https://pypi.org/project/numpy-ext/)
output:
For large datasets, the speed of these operations becomes an issue. So, to compare the speed of these different methods, I created a timing function:
Next, let's make a large DataFrame, just by copying the existing dataframe 500 times:
Finally, we run the three tests under a timing function:
Which gives us this output:
So, the fastest method is to use the numpy_ext, which makes sense because that's optimized for vectorized calculations. The second fastest method is the custom function I wrote, which is somewhat efficient because it does some vectorized calculations while also doing some Pandas lookups. The slowest method by far is using Pandas rolling function.