90 days) by group" column. I created this table in excel as a prototy" /> 90 days) by group" column. I created this table in excel as a prototy" /> 90 days) by group" column. I created this table in excel as a prototy"/>

How to Calculate a Conditional Rolling Count by Group, comparing to current row, in Using SQL

69 Views Asked by At

I'm trying to create query (duckdb) that generates a "Rolling Count Very Late (> 90 days) by group" column.

I created this table in excel as a prototype of what I am after:

This formula counts the number of past/prior rows where:

  • The 'days' value (column D) is greater than 90,
  • The 'group' value (column F) matches the current row's group,
  • The 'Date Closed' value is less than the current row's 'Invoice Date'.
A B C D E F G H
1 Invoice Date Due Date Date Closed days amount group Rolling Count Very Late (> 90 days) by group Rolling Count Very Late (> 90 days) by group formula
2 22-Oct-22 21-Nov-22 1-Mar-23 100 111 B 0 =COUNTIFS(E$3:E3,">90",G$3:G3,G3,D$3:D3,"<"&B3)
3 24-Oct-22 23-Nov-22 3-Mar-23 100 150 B 0 =COUNTIFS(E$3:E4,">90",G$3:G4,G4,D$3:D4,"<"&B4)
4 31-Oct-22 30-Nov-22 5-Mar-23 95 300 A 0 =COUNTIFS(E$3:E5,">90",G$3:G5,G5,D$3:D5,"<"&B5)
5 10-Nov-22 10-Dec-22 2-Feb-23 54 180 A 0 =COUNTIFS(E$3:E6,">90",G$3:G6,G6,D$3:D6,"<"&B6)
6 21-Nov-22 21-Dec-22 4-Feb-23 45 220 B 0 =COUNTIFS(E$3:E7,">90",G$3:G7,G7,D$3:D7,"<"&B7)
7 4-Dec-22 3-Jan-23 4-Jan-23 1 210 B 0 =COUNTIFS(E$3:E8,">90",G$3:G8,G8,D$3:D8,"<"&B8)
8 19-Dec-22 18-Jan-23 20-Jan-23 2 100 A 0 =COUNTIFS(E$3:E9,">90",G$3:G9,G9,D$3:D9,"<"&B9)
9 1-Jan-23 31-Jan-23 20-Feb-23 20 200 B 0 =COUNTIFS(E$3:E10,">90",G$3:G10,G10,D$3:D10,"<"&B10)
10 22-Jan-23 21-Feb-23 1-Jun-23 100 280 B 0 =COUNTIFS(E$3:E11,">90",G$3:G11,G11,D$3:D11,"<"&B11)
11 28-Feb-23 30-Mar-23 2-Apr-23 3 250 A 0 =COUNTIFS(E$3:E12,">90",G$3:G12,G12,D$3:D12,"<"&B12)
12 14-May-23 13-Jun-23 1-Sep-23 80 21 A 1 =COUNTIFS(E$3:E13,">90",G$3:G13,G13,D$3:D13,"<"&B13)
13 18-Jun-23 18-Jul-23 1-Oct-23 75 456 A 1 =COUNTIFS(E$3:E14,">90",G$3:G14,G14,D$3:D14,"<"&B14)
14 2-Jul-23 1-Aug-23 1-Aug-23 0 320 B 3 =COUNTIFS(E$3:E15,">90",G$3:G15,G15,D$3:D15,"<"&B15)

The current row piece is the obstacle, not sure how to fit that into the query

I have tried a window function but I cannot get the query to filter out results where "Date Closed" < current row "Invoice Date" because SQL is very set oriented and doesn't really have a nice easy way of doing this. Maybe there's a self join solution or something but that's beyond my ability unfortunately. I would appreciate any help. I have an extremely slow solution python that is basically too slow to even consider using with a large amount of data so hoping an SQL solution is available.

Here's my completely not working SQL

SELECT 
  "Invoice Date",
  "Due Date",
  "Date Closed",
  "days",
  "amount",
  "group",
  SUM(CASE WHEN days > 90 AND "Date Closed" < "Invoice Date" THEN 1 ELSE 0 END) 
    OVER (PARTITION BY "group" ORDER BY "Invoice Date" ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
    AS "Rolling Count Very Late (> 90 days) by group"
FROM 
  main.example_tbl 
ORDER BY 
   "Invoice Date";

1

There are 1 best solutions below

0
jqurious On BEST ANSWER

As mentioned, you can use a self join.

Sample data using Python API

import duckdb
import io

db = duckdb.read_csv(io.BytesIO(b"""
Invoice Date|Due Date|Date Closed|days|amount|group
22-Oct-22|21-Nov-22|1-Mar-23|100|111|B
24-Oct-22|23-Nov-22|3-Mar-23|100|150|B
31-Oct-22|30-Nov-22|5-Mar-23|95|300|A
10-Nov-22|10-Dec-22|2-Feb-23|54|180|A
21-Nov-22|21-Dec-22|4-Feb-23|45|220|B
4-Dec-22|3-Jan-23|4-Jan-23|1|210|B
19-Dec-22|18-Jan-23|20-Jan-23|2|100|A
1-Jan-23|31-Jan-23|20-Feb-23|20|200|B
22-Jan-23|21-Feb-23|1-Jun-23|100|280|B
28-Feb-23|30-Mar-23|2-Apr-23|3|250|A
14-May-23|13-Jun-23|1-Sep-23|80|21|A
18-Jun-23|18-Jul-23|1-Oct-23|75|456|A
2-Jul-23|1-Aug-23|1-Aug-23|0|320|B
""".strip())
)

db = db.select("""
row_number() over () row_number,
* replace( 
   strptime("Invoice Date", '%d-%b-%y') as "Invoice Date",
   strptime("Due Date",     '%d-%b-%y') as "Due Date",
   strptime("Date Closed",  '%d-%b-%y') as "Date Closed"
)
""")
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────┬────────┬─────────┐
│ row_number │    Invoice Date     │      Due Date       │     Date Closed     │ days  │ amount │  group  │
│   int64    │      timestamp      │      timestamp      │      timestamp      │ int64 │ int64  │ varchar │
├────────────┼─────────────────────┼─────────────────────┼─────────────────────┼───────┼────────┼─────────┤
│          1 │ 2022-10-22 00:00:00 │ 2022-11-21 00:00:00 │ 2023-03-01 00:00:00 │   100 │    111 │ B       │
│          2 │ 2022-10-24 00:00:00 │ 2022-11-23 00:00:00 │ 2023-03-03 00:00:00 │   100 │    150 │ B       │
│          3 │ 2022-10-31 00:00:00 │ 2022-11-30 00:00:00 │ 2023-03-05 00:00:00 │    95 │    300 │ A       │
│          4 │ 2022-11-10 00:00:00 │ 2022-12-10 00:00:00 │ 2023-02-02 00:00:00 │    54 │    180 │ A       │
│          5 │ 2022-11-21 00:00:00 │ 2022-12-21 00:00:00 │ 2023-02-04 00:00:00 │    45 │    220 │ B       │
│          6 │ 2022-12-04 00:00:00 │ 2023-01-03 00:00:00 │ 2023-01-04 00:00:00 │     1 │    210 │ B       │
│          7 │ 2022-12-19 00:00:00 │ 2023-01-18 00:00:00 │ 2023-01-20 00:00:00 │     2 │    100 │ A       │
│          8 │ 2023-01-01 00:00:00 │ 2023-01-31 00:00:00 │ 2023-02-20 00:00:00 │    20 │    200 │ B       │
│          9 │ 2023-01-22 00:00:00 │ 2023-02-21 00:00:00 │ 2023-06-01 00:00:00 │   100 │    280 │ B       │
│         10 │ 2023-02-28 00:00:00 │ 2023-03-30 00:00:00 │ 2023-04-02 00:00:00 │     3 │    250 │ A       │
│         11 │ 2023-05-14 00:00:00 │ 2023-06-13 00:00:00 │ 2023-09-01 00:00:00 │    80 │     21 │ A       │
│         12 │ 2023-06-18 00:00:00 │ 2023-07-18 00:00:00 │ 2023-10-01 00:00:00 │    75 │    456 │ A       │
│         13 │ 2023-07-02 00:00:00 │ 2023-08-01 00:00:00 │ 2023-08-01 00:00:00 │     0 │    320 │ B       │
├────────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────┴────────┴─────────┤
│ 13 rows                                                                                       7 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Self Join

We used row_number() over () row_number to add a row number column.

This is used as part of the join condition to match "previous rows".

duckdb.sql("""
from 
   db self left join db other 
on
   self.group = other.group
   and
   self.row_number > other.row_number 
   and
   self."Invoice Date" > other."Date Closed"
   and
   other.days > 90
order by 
   self.row_number
""")
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────┬───┬─────────────────────┬─────────────────────┬───────┬────────┬─────────┐
│ row_number │    Invoice Date     │      Due Date       │     Date Closed     │ days  │ … │      Due Date       │     Date Closed     │ days  │ amount │  group  │
│   int64    │      timestamp      │      timestamp      │      timestamp      │ int64 │   │      timestamp      │      timestamp      │ int64 │ int64  │ varchar │
├────────────┼─────────────────────┼─────────────────────┼─────────────────────┼───────┼───┼─────────────────────┼─────────────────────┼───────┼────────┼─────────┤
│          1 │ 2022-10-22 00:00:00 │ 2022-11-21 00:00:00 │ 2023-03-01 00:00:00 │   100 │ … │ NULL                │ NULL                │  NULL │   NULL │ NULL    │
│          2 │ 2022-10-24 00:00:00 │ 2022-11-23 00:00:00 │ 2023-03-03 00:00:00 │   100 │ … │ NULL                │ NULL                │  NULL │   NULL │ NULL    │
│          3 │ 2022-10-31 00:00:00 │ 2022-11-30 00:00:00 │ 2023-03-05 00:00:00 │    95 │ … │ NULL                │ NULL                │  NULL │   NULL │ NULL    │
│          4 │ 2022-11-10 00:00:00 │ 2022-12-10 00:00:00 │ 2023-02-02 00:00:00 │    54 │ … │ NULL                │ NULL                │  NULL │   NULL │ NULL    │
│          5 │ 2022-11-21 00:00:00 │ 2022-12-21 00:00:00 │ 2023-02-04 00:00:00 │    45 │ … │ NULL                │ NULL                │  NULL │   NULL │ NULL    │
│          6 │ 2022-12-04 00:00:00 │ 2023-01-03 00:00:00 │ 2023-01-04 00:00:00 │     1 │ … │ NULL                │ NULL                │  NULL │   NULL │ NULL    │
│          7 │ 2022-12-19 00:00:00 │ 2023-01-18 00:00:00 │ 2023-01-20 00:00:00 │     2 │ … │ NULL                │ NULL                │  NULL │   NULL │ NULL    │
│          8 │ 2023-01-01 00:00:00 │ 2023-01-31 00:00:00 │ 2023-02-20 00:00:00 │    20 │ … │ NULL                │ NULL                │  NULL │   NULL │ NULL    │
│          9 │ 2023-01-22 00:00:00 │ 2023-02-21 00:00:00 │ 2023-06-01 00:00:00 │   100 │ … │ NULL                │ NULL                │  NULL │   NULL │ NULL    │
│         10 │ 2023-02-28 00:00:00 │ 2023-03-30 00:00:00 │ 2023-04-02 00:00:00 │     3 │ … │ NULL                │ NULL                │  NULL │   NULL │ NULL    │
│         11 │ 2023-05-14 00:00:00 │ 2023-06-13 00:00:00 │ 2023-09-01 00:00:00 │    80 │ … │ 2022-11-30 00:00:00 │ 2023-03-05 00:00:00 │    95 │    300 │ A       │
│         12 │ 2023-06-18 00:00:00 │ 2023-07-18 00:00:00 │ 2023-10-01 00:00:00 │    75 │ … │ 2022-11-30 00:00:00 │ 2023-03-05 00:00:00 │    95 │    300 │ A       │
│         13 │ 2023-07-02 00:00:00 │ 2023-08-01 00:00:00 │ 2023-08-01 00:00:00 │     0 │ … │ 2023-02-21 00:00:00 │ 2023-06-01 00:00:00 │   100 │    280 │ B       │
│         13 │ 2023-07-02 00:00:00 │ 2023-08-01 00:00:00 │ 2023-08-01 00:00:00 │     0 │ … │ 2022-11-23 00:00:00 │ 2023-03-03 00:00:00 │   100 │    150 │ B       │
│         13 │ 2023-07-02 00:00:00 │ 2023-08-01 00:00:00 │ 2023-08-01 00:00:00 │     0 │ … │ 2022-11-21 00:00:00 │ 2023-03-01 00:00:00 │   100 │    111 │ B       │
├────────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────┴───┴─────────────────────┴─────────────────────┴───────┴────────┴─────────┤
│ 15 rows                                                                                                                                   14 columns (10 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

We can then:

  • GROUP BY the row_number column
  • select the first of each LHS column (i.e. self.*)
  • count the number of (non-null) RHS row_number values (other.row_number)
duckdb.sql("""
select
   first(
      columns(self.*)
   ),
   count(other.row_number)
from 
   db self left join db other 
on
   self.group = other.group
   and
   self.row_number > other.row_number 
   and
   self."Invoice Date" > other."Date Closed"
   and
   other.days > 90
group by
   self.row_number
order by 
   self.row_number
""")
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────┬────────┬─────────┬─────────────────────────┐
│ row_number │    Invoice Date     │      Due Date       │     Date Closed     │ days  │ amount │  group  │ count(other.row_number) │
│   int64    │      timestamp      │      timestamp      │      timestamp      │ int64 │ int64  │ varchar │          int64          │
├────────────┼─────────────────────┼─────────────────────┼─────────────────────┼───────┼────────┼─────────┼─────────────────────────┤
│          1 │ 2022-10-22 00:00:00 │ 2022-11-21 00:00:00 │ 2023-03-01 00:00:00 │   100 │    111 │ B       │                       0 │
│          2 │ 2022-10-24 00:00:00 │ 2022-11-23 00:00:00 │ 2023-03-03 00:00:00 │   100 │    150 │ B       │                       0 │
│          3 │ 2022-10-31 00:00:00 │ 2022-11-30 00:00:00 │ 2023-03-05 00:00:00 │    95 │    300 │ A       │                       0 │
│          4 │ 2022-11-10 00:00:00 │ 2022-12-10 00:00:00 │ 2023-02-02 00:00:00 │    54 │    180 │ A       │                       0 │
│          5 │ 2022-11-21 00:00:00 │ 2022-12-21 00:00:00 │ 2023-02-04 00:00:00 │    45 │    220 │ B       │                       0 │
│          6 │ 2022-12-04 00:00:00 │ 2023-01-03 00:00:00 │ 2023-01-04 00:00:00 │     1 │    210 │ B       │                       0 │
│          7 │ 2022-12-19 00:00:00 │ 2023-01-18 00:00:00 │ 2023-01-20 00:00:00 │     2 │    100 │ A       │                       0 │
│          8 │ 2023-01-01 00:00:00 │ 2023-01-31 00:00:00 │ 2023-02-20 00:00:00 │    20 │    200 │ B       │                       0 │
│          9 │ 2023-01-22 00:00:00 │ 2023-02-21 00:00:00 │ 2023-06-01 00:00:00 │   100 │    280 │ B       │                       0 │
│         10 │ 2023-02-28 00:00:00 │ 2023-03-30 00:00:00 │ 2023-04-02 00:00:00 │     3 │    250 │ A       │                       0 │
│         11 │ 2023-05-14 00:00:00 │ 2023-06-13 00:00:00 │ 2023-09-01 00:00:00 │    80 │     21 │ A       │                       1 │
│         12 │ 2023-06-18 00:00:00 │ 2023-07-18 00:00:00 │ 2023-10-01 00:00:00 │    75 │    456 │ A       │                       1 │
│         13 │ 2023-07-02 00:00:00 │ 2023-08-01 00:00:00 │ 2023-08-01 00:00:00 │     0 │    320 │ B       │                       3 │
├────────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────┴────────┴─────────┴─────────────────────────┤
│ 13 rows                                                                                                                 8 columns │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘