I'm doing GROUP EACH BY and JOIN EACH queries on huge tables. Because the cardinality of the grouping keys is too high, I 'shard' the query like this: (simplified example)
SELECT key FROM
(SELECT key FROM [table] WHERE ABS(HASH(key) % 2 = 0),
(SELECT key FROM [table] WHERE ABS(HASH(key)) % 2 = 1)
Now the problem is that I have tables that require so many shards to keep the grouping key from being too large, that the overall query becomes too slow and results in a timeout.
I can probably solves this by running all the shards as separate queries and storing the intermediate results in temporary table. But I would really like to solve this without having to create additional tables (which would lead to additional costs).
Any suggestions?
Here's one of the queries that has this problem (I use coffeescript to produce the sharded queries)
"SELECT * FROM " +
(
"""
(SELECT
key, NEST(things) as things, FIRST(category) as category
FROM
(SELECT
things, key, category, events,
RATIO_TO_REPORT(events) OVER (PARTITION BY key) AS presence
FROM (
SELECT
a.key as key, a.category as category
a.things as things, a.events as events
FROM [table1] a
JOIN EACH (
SELECT key FROM [table2]
WHERE things BETWEEN 2 AND 10 AND ABS(HASH(key)) % #{shards} = #{shard}
) b
ON a.key = b.key
WHERE ABS(HASH(a.key)) % #{shards} = #{shard}))
WHERE presence > 0.1 AND ABS(HASH(key)) % #{shards} = #{shard}
GROUP EACH BY key
HAVING COUNT(things) > 1)
""" for shard in [0..(shards-1)]
).join ','