Get combined intervals for overlapping sessions

70 Views Asked by At

I want to calculate session intervals for multiple, possibly overlapping sessions per agent. I have a log table like this in a PostgreSQL database:

id  dt                  session_id  agent_id state
7   2024-01-25 22:26:57 4           3148    0
7   2024-01-25 22:24:57 4           3148    1
6   2024-01-25 22:23:57 2           3148    0
5   2024-01-25 15:53:30 1           3148    0
4   2024-01-25 15:53:30 3           3148    0
3   2024-01-25 13:53:02 3           3148    1
2   2024-01-25 12:43:10 2           3148    1
1   2024-01-25 12:30:02 1           3148    1

state = 1 indicates the start of a session, state = 0 its end.
For single sessions I use the lag() function:

select agent_id, status, dt, lag(dt) over (partition by (agent_id) order by dt) as pre_status_dt  from 
(
    select *, 
      lag(status) over (partition by (agent_id) order by dt) as pre_status 
    from (
        select cs.dt, cs.user_id as agent_id, cs.status
        from "channel_subscribe" cs 
        where cs.channel = 14
    ) as t1
    where t1.dt >= '2024-01-01'
) as t2
where t2.status != t2.pre_status
order by dt asc

But that does not work for overlapping sessions.

I would like to get this result:

agent_id  start_dt             end_dt
3148      2024-01-25 12:30:02  2024-01-25 22:23:57
3148      2024-01-25 22:24:57  2024-01-25 22:26:57
1

There are 1 best solutions below

0
Erwin Brandstetter On BEST ANSWER

Your sample shows nested sessions: Additional sessions start and end within the first open session of the same agent. Alternatively, sessions might be chained: next session starts within the first session, but ends later.

The following solutions cover both cases: A combined session starts with the first open session, and ends once all open sessions have been closed.

Each assumes consistent data: Every session that gets closed has been opened before (or at the same time, but with am earlier id). Unfinished sessions are reported as such (with null for end_dt). If your data is not as reliable, you need to define what can go wrong, and handle cases accordingly.

Pure SQL

SELECT agent_id
     , min(dt) AS start_dt
     , max(dt) FILTER (WHERE sum_state = 0) AS end_dt  -- !
FROM  (
   SELECT *
        , count(*) FILTER (WHERE sum_state = 0)
          OVER (PARTITION BY agent_id ORDER BY dt DESC, id DESC) AS island
   FROM  (
      SELECT *
           , sum(CASE state WHEN 0 THEN -1 ELSE 1 END)
             OVER (PARTITION BY agent_id ORDER BY dt, id) sum_state
      FROM   channel_subscribe
      ) sub1
   ) sub2
GROUP  BY agent_id, island
ORDER  BY agent_id, island DESC;

The query in sub1 adds a running count per agent as sum_state. A combined session ends when sum_state goes back to 0.

Subquery sub2 forms groups (island) by counting "0"-events per agent. Note the descending sort (DESC) to include each "0"-event in its group.

The outer query reports data as requested for every combined session. The added FILTER (WHERE sum_state = 0) makes sure that unfinished sessions are reported as such.

Procedural solution

Once we need multiple subquery levels (with changing sort order), it's probably faster to use a (combined) procedural solution like this PL/pgSQL function:

CREATE OR REPLACE FUNCTION f_combined_sessions()
  RETURNS TABLE (agent_id int, start_dt timestamp, end_dt timestamp)
  LANGUAGE plpgsql AS
$func$
DECLARE
   r            record;
   _new_session bool;
   _agent_id    int;
BEGIN
   FOR r IN
      SELECT c.agent_id, c.dt
           , sum(CASE c.state WHEN 0 THEN -1 ELSE 1 END)
             OVER (PARTITION BY c.agent_id ORDER BY c.dt, c.id) AS sum_state
      FROM   channel_subscribe c
      ORDER  BY c.agent_id,  c.dt, c.id
   LOOP
      IF agent_id = r.agent_id THEN  -- same agent
         IF _new_session = true THEN
            start_dt     := r.dt;
            _new_session := false;
         END IF;
      ELSE                           -- new agent
         -- return open session?
         IF _new_session = false THEN
            end_dt := null;
            RETURN NEXT;
         END IF;
         agent_id     := r.agent_id;
         start_dt     := r.dt;
         _new_session := false;
      END IF;
      
      IF r.sum_state = 0 THEN
         end_dt := r.dt;
         RETURN NEXT;
         _new_session := true;
      END IF;
   END LOOP;

   -- return last open session?
   IF _new_session = false THEN
      end_dt := null;
      RETURN NEXT;
   END IF;
END
$func$;

Call:

SELECT * FROM f_combined_sessions();

fiddle


Related: