Redshift slow sub query (if it includes certain columns)

609 Views Asked by At

I have a query as follows:

SELECT
    e.event_id,
    (
        SELECT
            event_id
        FROM atomic.events
        WHERE
            domain_userid = e.domain_userid
        ORDER BY collector_tstamp
        LIMIT 1
    ) AS parent_event_id
FROM snowplow_intermediary.events_enriched e
LIMIT 1

I'm trying to find the first event for each user. This is fairly fast ~5s. If I'm trying to find the user by user_ipaddress instead of domain_userid it's way slower. After 300s it's not done.

SELECT
    e.event_id,
    (
        SELECT
            event_id
        FROM atomic.events
        WHERE
            user_ipaddress = e.user_ipaddress
        ORDER BY collector_tstamp
        LIMIT 1
    ) AS parent_event_id
FROM snowplow_intermediary.events_enriched e
LIMIT 1

The data type is domain_userid varchar(36) encode runlength and user_ipaddress varchar(45) encode runlength.

Here's EXPLAIN's for the queries:

https://gist.github.com/mortenstarfly/4ce3be9b3a19aac2601a

https://gist.github.com/mortenstarfly/2008b0f737259df30695

I really would like to speed up the second query. Any suggestions?

1

There are 1 best solutions below

0
On

It could be because of your short key..if your data is sorted based on user id then data will be retrieved fast(for the first query) as redshifts will know in which segment your data is residing (based on zone map) and can skip lot of slices and your io will be significantly low.