Bigquery streaming buffer taking too long (90 minutes each time)

166 Views Asked by At

I am streaming data from pub/sub to bigquery using dataflow templates via this code [1]. Data is typically available after 40 seconds in preview. Unfortunately most of the times querying can be done after 90 minutes (as in documentation). I've tried also STREAMING_INSERTS and STORAGE_WRITE_API method with same latency.

But when using the bigquery api directly [2], data stays in the streaming buffer for 3-4 minutes and then it is available for querying. Rows even go in the buffer and become available before the rows from case [1].

This makes me think that the default behaviour of the storage write api is what I need to make the streaming faster. There should be some setting to be done in apache beam in order to make it behave the same way as direct usage of storage write api.

Any ideas what could be tweaked in this pipeline [1] or other dataflow setting to make it work?

[1]

WriteResult writeResult =
        convertedTableRows
            .get(TRANSFORM_OUT)
            .apply(
                "WriteSuccessfulRecords",
                BigQueryIO.writeTableRows()
                    .withoutValidation()
                    .withCreateDisposition(CreateDisposition.CREATE_NEVER)
                    .withWriteDisposition(WriteDisposition.WRITE_APPEND)
                    .withExtendedErrorInfo()
                    .withMethod(BigQueryIO.Write.Method.STORAGE_API_AT_LEAST_ONCE)
                    .withNumStorageWriteApiStreams(0)
                    .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
                    .to(
                        input ->
                            getTableDestination(
                                input, tableNameAttr, datasetNameAttr, outputTableProject)));

[2]

com.google.cloud.bigquery.storage.v1.BigQueryWriteClient

0

There are 0 best solutions below