I have a table containing the following fields:
email - logged user email
allowed_id - A ID of another User
The table contains multiple entries for the same email, each one containing a different allowed_id.
I'm trying to aggregate this in an array in order to save it on Redis to speed up one of the internal processes.
Usually, I'd use ArrayAgg, but this is not available in Redshift. Redshift has a ListAgg function that kind of works the same, but it transforms everything into a string and it has a 64k length limit, which I've already hit in my first tries. When moving this to production I'll face an even larger dataset.
It's important to know that the time of the query is not really important, it will run as a cronjob everyday round 2:00 AM.
I've been trying to use the Array function, but it returns something like:
email, [id]
same_email, [another_id]
And this is not what I'm looking for.
This is my query:
SELECT
email,
ARRAY(allowed_id) AS user_ids
FROM
sec_table
GROUP BY
email, allowed_id;
Just to make it clearer, this is the type of result I'm trying to achieve:
email, [id1, id2, id3]
I believe the 64k listagg limit is just that - a hard limit.
see: how to handle Listagg size limit in redshift? (nb adjust the 10000 used below to suit your data)
Following this approach you could arrive at fewer rows, and of these some would need further stitching together - (perhaps using python? not sure if this solves the memory issue).
Alternatively - and I almost never suggest this - try a procedural approach
Create a summary table with a super column e.g.:
Now use a stored procedure to populate that table e.g:
caveats try this on a small scale initially as what you see above is utterly untested and, if it works, may work slowly. Then you face the issue of getting that summary table out - that is possibly another question, and not something I'm trying to cover here.