Spark Kafka: understanding offset management with enable.auto.commit

75 Views Asked by AzUser1 At 25 July 2023 at 09:07

according to the Kafka documentation offset in Kafka can be managed using enable.offset.commit and auto.commit.interval.ms. I have difficulties understanding the concept.

For example I have a Kafka that shall batch load everyday and only shall load the new entries since the last loading process. How do I configure both parameters and what are the pros and cons of auto offset management?

This site: https://www.learningjournal.guru/courses/kafka/kafka-foundation-training/offset-management/ states the interval is normally set to 5 seconds. Does this mean the latest offset is updated 5 seconds after Kafka ran or genereally after 5 seconds the stored offset is updated regardless of the last run? If Kafka stores the offset itself how does the retrieving process work?

There is the starting offset parameter startingOffsets. How can I retrieve the last auto commited offset. Currently I understand you can set it only to earliest for batch or a manually input.

Edit: Added Code

spark.sparkContext.setCheckpointDir("directory")

df = spark.read.format("kafka") \
    .option("kafka.bootstrap.servers", bootstrap_server) \
    .option("kafka.sasl.mechanism", "PLAIN") \
    .option("kafka.security.protocol", "SASL_SSL") \
    .option("kafka.sasl.jaas.config", jaas_config) \
    .option("kafka.group.id", parameter_group)\
    .option("startingOffsets", parameter_offset_start) \
    .option("endingOffsets", parameter_offset_end) \
    .option("subscribe", topic) \
    .load()

Original Q&A

Spark Kafka: understanding offset management with enable.auto.commit

There are 0 best solutions below

Related Questions in APACHE-SPARK

Related Questions in APACHE-KAFKA

Related Questions in SPARK-STRUCTURED-STREAMING

Related Questions in SPARK-KAFKA-INTEGRATION

Trending Questions

Popular # Hahtags

Popular Questions