Specifying checkpoint location when structured streaming the data from kafka topics

515 Views Asked by swetha k At 16 September 2022 at 03:45

I have built a spark structured streaming application which reads the data from kafka topics,I have specified the starting offsets as latest and what happens if there is any failure from spark side, from which point/offset the data will continue to read after restarting and is it good idea to have checkpoint specified in the write stream to make sure we are reading from the point where the application/spark has failed? Please let me know.

Original Q&A

There are 2 best solutions below

OneCricketeer On 16 September 2022 at 13:41

You can use checkpoints, yes, or you can set kafka.group.id (in Spark 3+, at least).

Otherwise, it may start back at the end of the topic

Christos Natsis On 20 October 2022 at 14:15

I would advise you to set offsets to earliest and configure a checkpointLocation (HDFS, MinIO, other). The setting kafka.group.id will not commit offsets back to Kafka (even in Spark 3+), unless you commit them manually using foreachBatch.

Specifying checkpoint location when structured streaming the data from kafka topics

There are 2 best solutions below

Related Questions in APACHE-SPARK

Related Questions in APACHE-KAFKA

Related Questions in SPARK-STRUCTURED-STREAMING

Related Questions in SPARK-CHECKPOINT

Trending Questions

Popular # Hahtags

Popular Questions