I have some event data that looks like this:
| time | id | status | value |
|-------------------------|----|----------|-------|
| 2020-08-26T21:29:01.000 | 2 | started | 8 |
| 2020-08-26T21:29:01.000 | 3 | started | 4 |
| 2020-08-26T21:29:02.000 | 2 | finished | 8 |
| 2020-08-26T21:29:03.000 | 4 | started | 12 |
| 2020-08-26T21:29:04.000 | 5 | started | 2 |
| 2020-08-26T21:29:05.000 | 6 | started | 24 |
| 2020-08-26T21:29:06.000 | 4 | finished | 12 |
| 2020-08-26T21:29:06.000 | 3 | finished | 4 |
| 2020-08-26T21:29:07.000 | 1 | finished | 1 |
| 2020-08-26T21:29:10.000 | 7 | started | 4 |
Note that the event data was started recording after things had started and there are events that haven't yet finished.
I'm trying to then get a running count of the rows and running sum of the values per second.
As soon as I think running count I think of window queries but I'm struggling to work out how I get from this data to the output I expect.
I'd ideally then expect to get the following results:
| time | count | sum_values |
|-------------------------|-------|------------|
| 2020-08-26T21:29:00.000 | 1 | 1 |
| 2020-08-26T21:29:01.000 | 3 | 13 |
| 2020-08-26T21:29:02.000 | 2 | 5 |
| 2020-08-26T21:29:03.000 | 3 | 17 |
| 2020-08-26T21:29:04.000 | 4 | 19 |
| 2020-08-26T21:29:05.000 | 5 | 43 |
| 2020-08-26T21:29:06.000 | 3 | 29 |
| 2020-08-26T21:29:07.000 | 2 | 28 |
| 2020-08-26T21:29:08.000 | 2 | 28 |
| 2020-08-26T21:29:09.000 | 2 | 28 |
| 2020-08-26T21:29:10.000 | 3 | 32 |
| 2020-08-26T21:29:11.000 | 3 | 32 |
I'd also be happy enough with an answer that didn't take into account the 1
id record that was running before events began recording which would then have the following results:
| time | count | sum_values |
|-------------------------|-------|------------|
| 2020-08-26T21:29:00.000 | 0 | 0 |
| 2020-08-26T21:29:01.000 | 2 | 12 |
| 2020-08-26T21:29:02.000 | 1 | 4 |
| 2020-08-26T21:29:03.000 | 2 | 16 |
| 2020-08-26T21:29:04.000 | 3 | 18 |
| 2020-08-26T21:29:05.000 | 4 | 42 |
| 2020-08-26T21:29:06.000 | 2 | 28 |
| 2020-08-26T21:29:07.000 | 2 | 28 |
| 2020-08-26T21:29:08.000 | 2 | 28 |
| 2020-08-26T21:29:09.000 | 2 | 28 |
| 2020-08-26T21:29:10.000 | 3 | 32 |
| 2020-08-26T21:29:11.000 | 3 | 32 |
As Athena/Presto doesn't support full joins I was able to get the start and stop times for each id
with the following query (also on SQL Fiddle):
WITH started AS (
SELECT *
FROM foo
WHERE status = 'started'
), finished AS (
SELECT *
FROM foo
WHERE status = 'finished'
)
SELECT started.time AS started_time, finished.time AS finished_time, started.id, started.value
FROM started LEFT JOIN finished ON started.id = finished.id
I think you want a cumulative conditional sum:
The
sum()
s need to be nested because one is needed for the window function and the other is for the aggregation.