BigQuery Storage Api row restrictions on clustered table: do they reduce cost?

590 Views Asked by At

I'm using the BigQuery Storage Api java client (from scala), reading from a clustered table having 4 clustered fields, e.g.

bq mk [...] --table --clustering_fields f1,f2,f3,f4 mytable mytableschema.json

and the schema looks like the following

[
  {
    "name": "f1",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "f2",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "f3",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "f4",
    "type": "STRING",
    "mode": "REQUIRED"
  },

 other fields...
]

Now, if I execute a normal query like this:

SELECT * FROM dataset.mytable 
WHERE f1 IN ('a', 'b') AND f2 IS NULL AND f3 = 'x'

I can see the reported "bytes processed" correctly using only the data belonging the the clusters filtered by (a, null, x) and (b, null, x)

If, instead, I try to export the data using the storage api and the exact same row restriction, costs seem to show the whole table size to be processed. There's no way to know the bytes processed, there's only an estimate in the api that reports that the whole table is being billed, and that's also what I see from the actual billing.

The storage api is used as follows. (It's actually wrapped in ZIO stream but this is the gist of it)

   val options =
        TableReadOptions
          .newBuilder()
          .setRowRestriction("f1 IN ('a', 'b') AND f2 IS NULL AND f3 = 'x'")
          .build()

    val readSessionBuilder =
      ReadSession
        .newBuilder()
        .setTable(tableName)
        .setDataFormat(DataFormat.AVRO)
        .setReadOptions(options)

    val readSessionRequestBuilder =
      CreateReadSessionRequest
        .newBuilder()
        .setParent(ProjectName)
        .setReadSession(readSessionBuilder)
        .setMaxStreamCount(1)
    
   val session = client.createReadSession(readSessionRequestBuilder.build())
   // ... read from session.getStreamsList

Does the storage api support at all cost reduction via row restrictions on clustered tables? I can't find any information about it anywhere.

0

There are 0 best solutions below