Given a DataFrame having timestamp (ts), I'd like to these by the hour (downsample). Values that were previously indexed by ts should now be divided into ratios based on the number of minutes left in an hour. [note: divide data in ratios for NaN columns while doing resampling]
ts event duration
0 2020-09-09 21:01:00 a 12
1 2020-09-10 00:10:00 a 22
2 2020-09-10 01:31:00 a 130
3 2020-09-10 01:50:00 b 60
4 2020-09-10 01:51:00 b 50
5 2020-09-10 01:59:00 b 26
6 2020-09-10 02:01:00 c 72
7 2020-09-10 02:51:00 b 51
8 2020-09-10 03:01:00 b 63
9 2020-09-10 04:01:00 c 79
def create_dataframe():
df = pd.DataFrame([{'duration':12, 'event':'a', 'ts':'2020-09-09 21:01:00'},
{'duration':22, 'event':'a', 'ts':'2020-09-10 00:10:00'},
{'duration':130, 'event':'a', 'ts':'2020-09-10 01:31:00'},
{'duration':60, 'event':'b', 'ts':'2020-09-10 01:50:00'},
{'duration':50, 'event':'b', 'ts':'2020-09-10 01:51:00'},
{'duration':26, 'event':'b', 'ts':'2020-09-10 01:59:00'},
{'duration':72, 'event':'c', 'ts':'2020-09-10 02:01:00'},
{'duration':51, 'event':'b', 'ts':'2020-09-10 02:51:00'},
{'duration':63, 'event':'b', 'ts':'2020-09-10 03:01:00'},
{'duration':79, 'event':'c', 'ts':'2020-09-10 04:01:00'},
{'duration':179, 'event':'c', 'ts':'2020-09-10 06:05:00'},
])
df.ts = pd.to_datetime(df.ts)
return df
I want to estimate a produced based on the ratio of time spend and produced. This can be compared to how many lines of code have been completed or find how many actual lines per hour?
for example: at "2020-09-10 00:10:00" we have 22. Then during the period from 21:01 - 00:10, we produced based on
59 min of 21:00 hours -> 7 => =ROUND(22/189*59,0)
60 min of 22:00 hours -> 7 => =ROUND(22/189*60,0)
60 min of 23:00 hours -> 7 => =ROUND(22/189*60,0)
10 min of 00:00 hours -> 1 => =ROUND(22/189*10,0)
the result should be something like.
ts event duration
0 2020-09-09 20:00:00 a NaN
1 2020-09-10 21:00:00 a 7
2 2020-09-10 22:00:00 a 7
3 2020-09-10 23:00:00 a 7
4 2020-09-10 00:00:00 a 1
5 2020-09-10 01:00:00 b ..
6 2020-09-10 02:01:00 c ..
Problem with this approach:
It appears to me that, we are having a serious issue with this approach. If you look at the rows[1] -> 2020-09-10 07:00:00, we have 4, we need to divide it between 3 hours. Considering base duration value as 1 (base unit), we however get
def create_dataframe2():
df = pd.DataFrame([{'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 07:00:00'},
{'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 10:00:00'}])
df.ts = pd.to_datetime(df.ts)
return df
Source
duration event c ts
0 4 c event3.5 2020-09-10 07:00:00
1 4 c event3.5 2020-09-10 10:00:00
Expected Output
ts_hourly mins duration
0 2020-09-10 07:00:00 60.0 2
1 2020-09-10 08:00:00 60.0 1
2 2020-09-10 09:00:00 60.0 1
3 2020-09-10 10:00:00 0.0 0
The first step is to add "previous ts" column to the source DataFrame:
Then set ts column as the index:
The third step is to create an auxiliary index, composed of the original index and "full hours":
Then create an auxiliary DataFrame, reindexed with the just created index and "back fill" event column:
Define a function to be applied to each group of rows from df2:
Then generate the source data for "produced" column in 2 steps:
Generate detailed data:
The source is df2 grouped this way that each group is terminated with a row with a non-null value in duration column. Then each group is processed with parts function.
The result is:
Generate aggregated data, for the time being also as a Series:
This time prodDet is resampled (broken down by hours) and the result is the sum of values.
The result is:
Let's describe the content of prodDet:
There is no row for 2020-09-09 20:00:00, because no source row is from this hour (your data start from 21:01:00).
Row
21:00:00 12
comes from the first source row (you forgot about it writing the expected result).Rows for 21:01:00, 22:00:00, 23:00:00 and 00:00:00 come from "partitioning" of row
00:10:00 a 22
, just as a part of your expected result.Rows with 80 and 50 come from row containing 130, divided between rows with hours 00:01:00 and 01:00:00.
And so on.
Now we start to assemble the final result.
Join prod (converted to a DataFrame) with event column:
Add tsMin column - the minimal ts in each hour (as you asked in one of comments):
Change the index into a regular column and set its name to ts (like in the source DataFrame):
The final result is:
E.g. the value of 81 for 00:00:00 is a sum of 1 and 80 (the first part resulting from row with 130), see prodDet above.
Some values in tsMin column are empty, for hours in which there is no source row.
If you want to totally drop the result from the first row (with duration == 12), change
return pd.Series([lstRow.duration]...
toreturn pd.Series([0]...
(the 4-th row of parts function).To sum up, my solution is more pandasonic and significantly shorter than yours (17 rows (my solution) vs. about 70 (yours), excluding comments).