I encountered a problem with consume data from Kafka.
- There are more than 500 million records.
- If you start consume data at 9 a.m., the latest data from Kafka will be obtained at 2 a.m. In fact, there is also data that was sent after 2 a.m. but it cannot be consume. This is an example. Configuration in use
consume_from_kafka = (spark.readStream
.format("kafka")
.option("subscribe", subscribes)
.option("kafka.bootstrap.servers", servers)
.option("startingOffsets", "earliest")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.mechanism", "PLAIN")
.option("kafka.sasl.jaas.config", configs)
.option("failOnDataLoss", "false")
.option("group.id", groups)
.option("client.id", clients)
.load()
)
As for the reason why we chose to use batch process, we wanted to update the data only 2 times a day. Currently using databricks with Spark and Delta Live Tables to do all the work.
When starting to consume data at 9:00 a.m., if data is sent at 6:00 a.m., it should be able to be consume as well.