Bigquery SELECT query cost much higher than table size

133 Views Asked by At

I have a BigQuery table which is filled by a datastream from CloudSQL. Recently I tried partitioning the table by DAY to reduce query costs. This didn't work (partition pruning was not being applied), but I left the table partitioned since switching back would be annoying. This was 3 days ago.

The table is truncated and refilled every day. Currently it has 1.8 million rows, and a size of around 700 MB plus 20 MB streaming buffer (querying the __TABLES__ meta-table gives the same value as the GUI).

However, running SELECT * FROM table_name costs 3 GB, and according to the Execution Details, nearly 7 million rows are read.

It seems as though the truncated data is still being queried by BigQuery -- the size lines up for 4 days of data. This didn't happen when the table was not partitioned, so why might it be happening now?

Other information that might be useful

Logical Bytes: ~700 MB, Physical Bytes: ~700 MB, Time Travel Bytes: ~650 MB, Number of partitions: 560.

The (simplified) table schema is:

event_date            DATE
hour                  INTEGER
data_columns          VARIOUS
...
datastream_metadata   RECORD
    uuid              STRING
    source_timestamp  INTEGER

There are no other timestamp columns, and the event_date column ranges over a few months, so isn't useful for filtering. I thought the source_timestamp column might be useful, but it turns out that it just represents the last modified date from CloudSQL, which isn't helpful here since that date might be many days in the past.

Update 1

Strangely, using BigQuery's time travel feature to query a version of the table without the streaming buffer reduces the query cost back to 700 MB. However, partition pruning still is not applied. For instance:

This query costs 3 GB:

SELECT * FROM table_name
    WHERE event_date = '2023-10-29';

while this costs 700 MB (for any time interval which is older than the oldest record in the streaming buffer):

SELECT * FROM table_name
    FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
    WHERE event_date = '2023-10-29';

Both return the same ~2000 rows, which, since the table is partitioned on event_date, should only cost a few hundred kilobytes.

This behaviour is inconsistent. The first time the time travel query is run, the actual cost will be 700 MB. Second and subsequent runs are billed for 3 GB.

0

There are 0 best solutions below