Optimizing Performance of Non-Recursive CTEs in BigQuery

44 Views Asked by At

I have a query like this.

WITH all_products AS (
  SELECT
    gtin,
    category,
    product_name,
    product_image,
    brand,
    manufacturer
  FROM `products`
),
client_products as (
  SELECT *
  FROM all_products
  WHERE
    client_id = "usdemoaccount" and is_client_product = true
),
competitor_products as (
  SELECT *
  FROM all_products
  WHERE
    client_id = "usdemoaccount" and is_client_product = false
);

Here computation for all_products happens twice because its reference twice in the below code. But we need not do it twice and save on compute if re-use the above results.

The reason for this from BQ documentation mentions, non recursive CTE's are not materialized.

BigQuery only materializes the results of recursive CTEs, but does not materialize the results of non-recursive CTEs inside the WITH clause. If a non-recursive CTE is referenced in multiple places in a query, then the CTE is executed once for each reference.

I'm exploring alternatives to address this issue. While I understand that using temporary tables is one option, I'm concerned about potential drawbacks such as increased storage costs and concurrency issues, especially when the same API is used by multiple users with different parameters.

What are some effective strategies or best practices for optimizing the performance of CTEs in BigQuery? Specifically, I'm interested in approaches that can help materialize non-recursive CTEs or improve query performance without resorting to temporary tables.

Even with temporary tables, if there is a option to automatically clean up those tables to avoid storage costs and handle concurrently out of the box, that should also be preferable.

2

There are 2 best solutions below

0
mjruttenberg On

One big factor in keeping costs down in BigQuery is to always explicitly specify the columns you want to select from, rather than using SELECT *.

Additionally, for runtime, limit the data to as few rows as possible, e.g. limit by date, and/or some other WHERE condition, so any later queries or CTEs only work with the smallest required number or rows.

Cost vs runtime

In the query provided, there are no additional costs as nothing is materialised. The original query only from products only runs once, so only 1 set of charges is incurred, even if you reference the first CTE twice later, so that isn't a cost or storage concern, only a runtime one.

Reducing runtime

To reduce the runtime, you can set the flag for client or competitor in the first (and only) CTE, and then call it as necessary using the derived field client_competitor_product. This replaces working out which CTE you want to use - client_product or competitor_product.

Reducing data parsed

Additionally, if the condition is the same in both clauses (client_id = "usdemoaccount"), move that to a WHERE clause so it limits the dataset it is working with before evaluating the CASE statement.

Notes on the original query from the OP

In the 2nd and 3rd CTEs, 2x referenced fields are not present in all_products, so they would fail - client_id and is_client_product are not present in all_products.

There is no output SELECT for the CTE, so which CTE could be referenced? Sometimes it is client_products and sometimes competitor_products. If you UNION ALL them, you still can't tell which product belongs to which source, without an additional field, e.g. source, to distinguish them.

Limit the fields selected, the data evaluated and derive the field/s required

WITH all_products AS (
  SELECT
    gtin,
    category,
    product_name,
    product_image,
    brand,
    manufacturer,
    CASE
      WHEN is_client_product = true THEN "client"
      WHEN is_client_product = false THEN "competitor"
    END AS client_competitor_product
  FROM `products`
  WHERE client_id = "usdemoaccount"
)

SELECT <limit fields>
FROM all_products
WHERE <some condition>;
0
Ashutosh Sahoo On

I believe the below two approaches can help you. Please let me know if you want to discuss more on this:

  1. Create simple logical views, crating logical views will not consume any space.
  2. Create materialized views.
  3. If you are loading the table once in a day , then a create a table whic can be used by multiple users.