Apache Beam, KafkaIO at least once semantics

354 Views Asked by At

We are implementing a pilot that reads from Kafka and writes to BigQuery.

Simple pipeline:

  • KafkaIO.read
  • BigQueryIO.write

We switched off the auto-commit. And we are using commitOffsetsInFinalize()

Can this setup guarantee that message will appear at least once in BigQuery and will not be lost, provided that everything is ok on the BigQueryIO side?

In the documentation for commitOffsetsInFinalize() I've met the following:  

It helps with minimizing gaps or duplicate processing of records while restarting a pipeline from scratch

I'm curious what "gaps" here are being referred to?

If you consider the edge cases, is there a possibility of messages being skipped and not delivered to BQ?

1

There are 1 best solutions below

0
On

Committing the offset for Apache Kafka means that if you were to restart your pipeline, it would start at the position within the stream before you restarted. Dataflow does have a guarantee that data won't be dropped when writing to BigQuery. But, using a distributed system, there is always a potential for something to go wrong (for example, a GCP outage).