Consume data from Kafka using batch processing

39 Views Asked by emperor At 17 August 2025 at 21:52

I encountered a problem with consume data from Kafka.

There are more than 500 million records.
If you start consume data at 9 a.m., the latest data from Kafka will be obtained at 2 a.m. In fact, there is also data that was sent after 2 a.m. but it cannot be consume. This is an example. Configuration in use

consume_from_kafka = (spark.readStream
    .format("kafka")
    .option("subscribe", subscribes)
    .option("kafka.bootstrap.servers", servers)
    .option("startingOffsets", "earliest")
    .option("kafka.security.protocol", "SASL_SSL")
    .option("kafka.sasl.mechanism", "PLAIN")
    .option("kafka.sasl.jaas.config", configs)
    .option("failOnDataLoss", "false")
    .option("group.id", groups)
    .option("client.id", clients)
    .load()
)

As for the reason why we chose to use batch process, we wanted to update the data only 2 times a day. Currently using databricks with Spark and Delta Live Tables to do all the work.

When starting to consume data at 9:00 a.m., if data is sent at 6:00 a.m., it should be able to be consume as well.

Original Q&A

Consume data from Kafka using batch processing

There are 0 best solutions below

Related Questions in APACHE-SPARK

Related Questions in APACHE-KAFKA

Related Questions in DATABRICKS

Related Questions in DELTA-LIVE-TABLES

Trending Questions

Popular # Hahtags

Popular Questions