Group by sharding: either cardinality is too high or queries timeout

410 Views Asked by At

I'm doing GROUP EACH BY and JOIN EACH queries on huge tables. Because the cardinality of the grouping keys is too high, I 'shard' the query like this: (simplified example)

SELECT key FROM 
  (SELECT key FROM [table] WHERE ABS(HASH(key) % 2 = 0),
  (SELECT key FROM [table] WHERE ABS(HASH(key)) % 2 = 1)

Now the problem is that I have tables that require so many shards to keep the grouping key from being too large, that the overall query becomes too slow and results in a timeout.

I can probably solves this by running all the shards as separate queries and storing the intermediate results in temporary table. But I would really like to solve this without having to create additional tables (which would lead to additional costs).

Any suggestions?

Here's one of the queries that has this problem (I use coffeescript to produce the sharded queries)

"SELECT * FROM " +
(
    """
    (SELECT
        key, NEST(things) as things, FIRST(category) as category
    FROM
        (SELECT
            things, key, category, events,
            RATIO_TO_REPORT(events) OVER (PARTITION BY key) AS presence
        FROM (
            SELECT
                a.key as key, a.category as category
                a.things as things, a.events as events
            FROM [table1] a
            JOIN EACH (
                SELECT key FROM [table2]
                WHERE things BETWEEN 2 AND 10 AND ABS(HASH(key)) % #{shards} = #{shard}
            ) b
            ON a.key = b.key
            WHERE ABS(HASH(a.key)) % #{shards} = #{shard}))
    WHERE presence > 0.1 AND ABS(HASH(key)) % #{shards} = #{shard}
    GROUP EACH BY key
    HAVING COUNT(things) > 1)
    """ for shard in [0..(shards-1)]
).join ','
0

There are 0 best solutions below