Redshift - Array returning single data per record

Question

Redshift - Array returning single data per record

162 Views Asked by andrepz At 17 August 2023 at 17:41

I have a table containing the following fields:

email - logged user email

allowed_id - A ID of another User

The table contains multiple entries for the same email, each one containing a different allowed_id.

I'm trying to aggregate this in an array in order to save it on Redis to speed up one of the internal processes.

Usually, I'd use ArrayAgg, but this is not available in Redshift. Redshift has a ListAgg function that kind of works the same, but it transforms everything into a string and it has a 64k length limit, which I've already hit in my first tries. When moving this to production I'll face an even larger dataset.

It's important to know that the time of the query is not really important, it will run as a cronjob everyday round 2:00 AM.

I've been trying to use the Array function, but it returns something like:

email, [id]
same_email, [another_id]

And this is not what I'm looking for.

This is my query:


    SELECT
      email,
      ARRAY(allowed_id) AS user_ids
    FROM
      sec_table
    GROUP BY
      email, allowed_id;

Just to make it clearer, this is the type of result I'm trying to achieve:

email, [id1, id2, id3]

Original Q&A

There are 1 best solutions below

**Adrian Maxwell** · Answer 1 · 2023-08-19T02:19:52.627000

I believe the 64k listagg limit is just that - a hard limit.

see: how to handle Listagg size limit in redshift? (nb adjust the 10000 used below to suit your data)

WITH numbered_rows AS (
  SELECT 
    email,
    allowed_id,
    NTILE(10000) OVER (PARTITION BY email ORDER BY allowed_id) AS chunk
  FROM your_table
)
SELECT 
  email,
  chunk,
  LISTAGG(allowed_id, ',') WITHIN GROUP (ORDER BY allowed_id) AS allowed_ids
FROM numbered_rows
GROUP BY email, chunk

Following this approach you could arrive at fewer rows, and of these some would need further stitching together - (perhaps using python? not sure if this solves the memory issue).

Alternatively - and I almost never suggest this - try a procedural approach

Create a summary table with a super column e.g.:

CREATE TABLE email_summary (
    email VARCHAR(256),
    allowed_ids SUPER
);

Now use a stored procedure to populate that table e.g:

CREATE OR REPLACE PROCEDURE create_summary()
LANGUAGE plpgsql
AS $$
DECLARE
    cur_email VARCHAR(256);
    cur_allowed_id VARCHAR(256);
    cur_allowed_ids SUPER := '[]'::SUPER;  -- Initialize an empty SUPER array
    prev_email VARCHAR(256) := NULL;
BEGIN
    FOR cur_email, cur_allowed_id IN SELECT email, allowed_id FROM your_existing_table ORDER BY email
    LOOP
        IF cur_email != prev_email AND prev_email IS NOT NULL THEN
            -- Insert the previous email and its allowed_ids into the summary table
            INSERT INTO email_summary (email, allowed_ids) VALUES (prev_email, cur_allowed_ids);
            -- Reset the allowed_ids array for the next email
            cur_allowed_ids := '[]'::SUPER;
        END IF;
        -- Add the current allowed_id to the allowed_ids array
        cur_allowed_ids := cur_allowed_ids || ('"' || cur_allowed_id || '"')::SUPER;
        -- Remember the current email for the next iteration
        prev_email := cur_email;
    END LOOP;
    -- Don't forget to insert the last email and its allowed_ids into the summary table
    IF prev_email IS NOT NULL THEN
        INSERT INTO email_summary (email, allowed_ids) VALUES (prev_email, cur_allowed_ids);
    END IF;
END;
$$;

caveats try this on a small scale initially as what you see above is utterly untested and, if it works, may work slowly. Then you face the issue of getting that summary table out - that is possibly another question, and not something I'm trying to cover here.

Redshift - Array returning single data per record

There are 1 best solutions below

Related Questions in SQL

Related Questions in AMAZON-REDSHIFT

Related Questions in OLAP

Related Questions in ARRAY-AGG

Trending Questions

Popular # Hahtags

Popular Questions