Specifying checkpoint location when structured streaming the data from kafka topics

515 Views Asked by At

I have built a spark structured streaming application which reads the data from kafka topics,I have specified the starting offsets as latest and what happens if there is any failure from spark side, from which point/offset the data will continue to read after restarting and is it good idea to have checkpoint specified in the write stream to make sure we are reading from the point where the application/spark has failed? Please let me know.

2

There are 2 best solutions below

0
OneCricketeer On

You can use checkpoints, yes, or you can set kafka.group.id (in Spark 3+, at least).

Otherwise, it may start back at the end of the topic

0
Christos Natsis On

I would advise you to set offsets to earliest and configure a checkpointLocation (HDFS, MinIO, other). The setting kafka.group.id will not commit offsets back to Kafka (even in Spark 3+), unless you commit them manually using foreachBatch.