BigQuery streaming insert from Dataflow - no results

1.1k Views Asked by At

I have a Dataflow pipeline which is reading messages from PubSub Lite and streams data into a BigQuery table. The table is partitioned by day. When querying the table with:

SELECT * FROM `my-project.my-dataset.my-table` WHERE DATE(timestamp) = "2021-10-14"

The BigQuery UI tells me This query will process 1.9 GB when run. But when actually running the query I don't get any results. My pipeline is running for a whole week now and I am getting the same results for the last two days. However, for 2021-10-11 and the days before that I am seeing actual results.

I am currently using Apache Beam version 2.26 and my Dataflow writer looks like this:

return BigQueryIO.<Event>write()
    .withSchema(createTableSchema())
    .withFormatFunction(event -> createTableRow(event))
    .withCreateDisposition(CreateDisposition.CREATE_NEVER)
    .withWriteDisposition(WriteDisposition.WRITE_APPEND)
    .withTimePartitioning(new TimePartitioning().setType("DAY").setField("timestamp"))
    .to(TABLE);

Why is BigQuery taking so long for committing the values to the partitions but at the same time telling me there is actually data available?

EDIT 1:

enter image description here

enter image description here

1

There are 1 best solutions below

1
On

BigQuery is processing data and not returning any rows because its processing also the data in your streaming buffer. Data on buffer is can take up to 90 min to be committed in the partitioned tables.

Check more details in this stack and also in the documentation available here.

When streaming to a partitioned table, data in the 
streaming buffer has a NULL value for the _PARTITIONTIME pseudo column.

If you are having problems to write the data from pubsub in BigQuery, I recommend you to use an template avaiable in dataflow.

Use an Dataflow template avaiable in GCP to write the data from PubSub to BigQuery:

There is an tempate to write data from a pubsub topic to bigquery and it already takes care of the possible corner cases.

I tested it as following and works perfectly:

  • Create a subscription in you PubSub topic;
  • Create bucket for temporary storage;
  • Create the job as following: enter image description here
  • For testing, I just sent a message to the topic in json format and the new data was added in the output table:

gcloud pubsub topics publish test-topic --message='{"field_dt": "2021-10-15T00:00:00","field_ts": "2021-10-15 00:00:00 UTC","item": "9999"}'

enter image description here

If you want something more complex, you can fork from the templates code from github and adjust it for your need.