How to allow LISTAGG overflow into new record in Snowflake

82 Views Asked by At

I want to aggregate a list of strings together, but I know this list of strings will exceed the limit. This set of strings has an identifier associated with it. So a table might look like this

id | my_strings
______________
x1 | string{1}
x1 | string{2}
.
.
.
x1 | string{N}
x2 | string{1}

What I would love for a result is the following:

id | all_strings
------------------
x1 | string{1},...,string{M}
x1 | string{M+1},...,string{N}
x2 | string{1}

So essentially, if all the strings from 1 to N for the id x1 exceed the limit, it cuts it off at the largest possible M and then makes a new record also with x1 as the id where the "all_strings" column is the rest.

Is there anyway in standard Snowflake SQL code to achieve this? I know that the strings are never identical to one another and but the length of the strings is constant.

Any help would be greatly appreciated.

2

There are 2 best solutions below

1
On BEST ANSWER

It is possible to add subgroup to determine max number of elements per group:

-- max 5 elements
WITH cte AS (
  SELECT *,  CEIL(ROW_NUMBER() OVER(PARTITION BY ID ORDER BY str) / 5) AS grp
  FROM t
)
SELECT ID, LISTAGG(str, ',') WITHIN GROUP (ORDER BY str) AS all_strings
FROM cte
GROUP BY ID, grp
ORDER BY ID, all_strings;

For input:

CREATE TABLE  t(id INT, str TEXT);

INSERT INTO t(id, str)
VALUES (1,'string01'),(1,'string02'),(1,'string03'),(1,'string04'),(1,'string05'),
(1,'string06'),(1,'string07'),(1,'string08'),(1,'string09'),(1,'string10'),(1,'string11'),
(2,'string12'),(2,'string13'),(2,'string14'),(2,'string15'),(2,'string16'),
(2,'string17'),(2,'string18'),(2,'string19');

Output:

ID  ALL_STRINGS
1   string01,string02,string03,string04,string05
1   string06,string07,string08,string09,string10
1   string11
2   string12,string13,string14,string15,string16
2   string17,string18,string19
2
On

using a Javascript user defined table function:

Change the 40 to the cutoff you want.

CREATE OR REPLACE FUNCTION "CHAR_SUM"(INS STRING)
    RETURNS TABLE (BUCKET FLOAT)
    LANGUAGE JAVASCRIPT
    AS '{
    processRow: function (row, rowWriter, context) {
      var nl = row.INS.length > 0 ? 
            (row.INS.length + (this.sum > 0 ? 1: 0)): 0;
      
      if(this.sum + nl > 40){ 
        this.bucket = this.bucket + 1;
        this.sum = 0;
      }
      
      this.sum = this.sum + nl;
      rowWriter.writeRow({BUCKET: this.bucket});
      
    },
    initialize: function(argumentInfo, context) {
     this.bucket = 0;
     this.sum = 0;
    }}';

then we can:

with data as ( 
    select 
        x.id,
        seq8() as seq,
        'string' || seq::text as str
    from table(generator(rowcount=>10))
    cross join (values ('x1'),('x2')) as x(id)
)
select *
from data, TABLE(char_sum(str) over (partition by id order by seq));

enter image description here

now we can to the aggregation, like so:

with data as ( 
    select 
        x.id,
        seq8() as seq,
        'string' || seq::text as str
    from table(generator(rowcount=>10))
    cross join (values ('x1'),('x2')) as x(id)
)
select 
    id,
    listagg(str, ',') within group (order by seq) as batch,
    length(batch) as qa
from data, TABLE(char_sum(str) over (partition by id order by seq))
group by 1, bucket
order by 1, bucket;

enter image description here