Propagation of BigQuery parition/cluster keys to a CTE table - Performance

826 Views Asked by At

I setup a persistent table in our BigQuery database (using Looker, if that's relevant). The table has both a partition_key and a few cluster_keys. I partition on time, then cluster on my primary key (made with GENERATE_UUID), plus the two major fields that users will search on.

I then have a CTE table that the rest of the queries pull data from. This CTE selects a subset of the persistent table (the one with the partition and cluster keys), but this table is not itself persistent, so I don't think I can include partition and cluster keys in it. It looks like this:

WITH my_table_pre_exclusion AS (--
        SELECT
            *
        FROM
            `server.data.prefix_my_table_persist`
        WHERE
        (
                      ( -- Some filter here
                ) -- AND ... some filter here
        )
    )

My question is: does pulling from this table (which pre-applies a bunch of filters) hurt performance when I later do a ton of joins involving fields that ARE in the partition or cluster key fields?

Would it be more performant to skip this CTE table, pull directly from the persistent table in all my downstream joins, and then re-apply the filters (which apply to everything downstream)? It would be a lot more code bloat, but I did some benchmarking, and I thinkkkk it's hurting performance, but I'm not really sure.

Is there a "best of both worlds" approach where I don't have to apply the same filters to a ton of downstream tables, but I still get optimal performance? Maybe inner join my_table_pre_exclusion to all the downstream tables after-the-fact?

1

There are 1 best solutions below

0
On

Posting my own answer to this, though I'd be happy for anybody else to elaborate, as I could only find very sparse documentation on this.

I was able to get some info from a helpful BigQuery expert: what I'm asking about is something called "Predicate Pushdown", which BigQuery recently added support for.

I'm still trying to read up on the details of the support, but this does not appear to be something unique to only BigQuery (although I'm sure its optimizers play a huge role in overall performance). You can read a little about it here: https://modern-sql.com/feature/with/performance#predicate-pushdown

The bottom line is that if BigQuery's support is sufficient for the query I'm running, then queries-on-subqueries will be efficiently executed using the partition/cluster keys. I read some docs from the initial release, anyways, saying that it might only work with the date-based partition key, but maybe it has since expanded support. The general topic of "