I have the following sample dataframe:
| ID | date | data | avg |
|---|---|---|---|
| 1 | 18/3/2024 | 45.72991477 | 9.845398369 |
| 1 | 19/3/2024 | 16.66879054 | 12.22665416 |
| 1 | 20/3/2024 | 0.956178897 | 12.29306183 |
| 1 | 21/3/2024 | 0.008082319 | 12.27617302 |
| 1 | 22/3/2024 | 18.4076123 | 14.88212601 |
| 1 | 23/3/2024 | 8.3212202 | 12.87506446 |
| 1 | 24/3/2024 | 2.950587398 | 13.29176949 |
| 1 | 25/3/2024 | 79.01593876 | |
| 2 | 18/3/2024 | 5.234445299 | 1.555186844 |
| 2 | 19/3/2024 | 13.36603398 | 3.441331288 |
| 2 | 20/3/2024 | 2.76175807 | 3.835868155 |
| 2 | 21/3/2024 | 0.007854825 | 3.710561654 |
| 2 | 22/3/2024 | 32.75987619 | 8.361661724 |
| 2 | 23/3/2024 | 0.363040191 | 7.823976139 |
| 2 | 24/3/2024 | 3.461841086 | 8.279264234 |
| 2 | 25/3/2024 | 6.28783374 |
The avg column is essentially a 7 days moving average. I currently use the following code to populate avg
index["avg"] = index.groupby("ID")["data"].transform(lambda x: x.rolling(7).mean())
I understand that every time code is run, it will recompute the whole avg column. Meanwhile my data is incrementally added on a daily basis. Is there a way to only compute the latest date avg using data from last 7 days including the latest added data for each ID? Note that the rolling average is for each ID. Will this approach be more efficient than my original approach of recomputing the whole avg column every time it is run?
I think calculating rolling mean for last 7 days will definitely be faster than recomputing for the entire dataframe. Here is the code I tried with some dummy data