I have a query as follows:
SELECT
e.event_id,
(
SELECT
event_id
FROM atomic.events
WHERE
domain_userid = e.domain_userid
ORDER BY collector_tstamp
LIMIT 1
) AS parent_event_id
FROM snowplow_intermediary.events_enriched e
LIMIT 1
I'm trying to find the first event for each user. This is fairly fast ~5s.
If I'm trying to find the user by user_ipaddress
instead of domain_userid
it's way slower. After 300s it's not done.
SELECT
e.event_id,
(
SELECT
event_id
FROM atomic.events
WHERE
user_ipaddress = e.user_ipaddress
ORDER BY collector_tstamp
LIMIT 1
) AS parent_event_id
FROM snowplow_intermediary.events_enriched e
LIMIT 1
The data type is domain_userid varchar(36) encode runlength
and user_ipaddress varchar(45) encode runlength
.
Here's EXPLAIN's for the queries:
https://gist.github.com/mortenstarfly/4ce3be9b3a19aac2601a
https://gist.github.com/mortenstarfly/2008b0f737259df30695
I really would like to speed up the second query. Any suggestions?
It could be because of your short key..if your data is sorted based on user id then data will be retrieved fast(for the first query) as redshifts will know in which segment your data is residing (based on zone map) and can skip lot of slices and your io will be significantly low.