Consume data from Kafka using batch processing

49 Views Asked by At

I encountered a problem with consume data from Kafka.

  1. There are more than 500 million records.
  2. If you start consume data at 9 a.m., the latest data from Kafka will be obtained at 2 a.m. In fact, there is also data that was sent after 2 a.m. but it cannot be consume. This is an example. Configuration in use
consume_from_kafka = (spark.readStream
    .format("kafka")
    .option("subscribe", subscribes)
    .option("kafka.bootstrap.servers", servers)
    .option("startingOffsets", "earliest")
    .option("kafka.security.protocol", "SASL_SSL")
    .option("kafka.sasl.mechanism", "PLAIN")
    .option("kafka.sasl.jaas.config", configs)
    .option("failOnDataLoss", "false")
    .option("group.id", groups)
    .option("client.id", clients)
    .load()
)

As for the reason why we chose to use batch process, we wanted to update the data only 2 times a day. Currently using databricks with Spark and Delta Live Tables to do all the work.

When starting to consume data at 9:00 a.m., if data is sent at 6:00 a.m., it should be able to be consume as well.

0

There are 0 best solutions below