Spark Kafka: understanding offset management with enable.auto.commit

75 Views Asked by At

according to the Kafka documentation offset in Kafka can be managed using enable.offset.commit and auto.commit.interval.ms. I have difficulties understanding the concept.

For example I have a Kafka that shall batch load everyday and only shall load the new entries since the last loading process. How do I configure both parameters and what are the pros and cons of auto offset management?

This site: https://www.learningjournal.guru/courses/kafka/kafka-foundation-training/offset-management/ states the interval is normally set to 5 seconds. Does this mean the latest offset is updated 5 seconds after Kafka ran or genereally after 5 seconds the stored offset is updated regardless of the last run? If Kafka stores the offset itself how does the retrieving process work?

There is the starting offset parameter startingOffsets. How can I retrieve the last auto commited offset. Currently I understand you can set it only to earliest for batch or a manually input.

Edit: Added Code

spark.sparkContext.setCheckpointDir("directory")

df = spark.read.format("kafka") \
    .option("kafka.bootstrap.servers", bootstrap_server) \
    .option("kafka.sasl.mechanism", "PLAIN") \
    .option("kafka.security.protocol", "SASL_SSL") \
    .option("kafka.sasl.jaas.config", jaas_config) \
    .option("kafka.group.id", parameter_group)\
    .option("startingOffsets", parameter_offset_start) \
    .option("endingOffsets", parameter_offset_end) \
    .option("subscribe", topic) \
    .load()
0

There are 0 best solutions below