I have a kafka stream of actual data that needs to be enriched using reference data.
Reference data can be fetched using API call. And any update on reference data comes on another kafka topic.
I want to fetch the whole reference data using API on startup and store in flink state. Also update the state based on the kafka reference data update.
How can the state be shared across task slots in multiple task managers so that I can connect the actual data stream and reference data stream to do the enrichment? This is to make sure all parallel operators have access to whole reference data and same state.
Is there any better way?
Flink's broadcast state pattern support should work for you, as far as ensuring that all enrichment data is available to all sub-tasks.
For starting all with all API-based data, and then adding updates from Kafka, I would use Flink's hybrid source support. In your
main()method, download enrichment data from the API and save in files, then use a hybrid source that first reads from the file system, and switches to Kafka when that's complete.