We are working on a mainframe modernization program where we need to process 3-5 Billion records per day in 6 different cycles in 24 hours window.
Records are stored in files in some specific format. Those records are processed to be usable by downstream systems.
We definitely need distributed processing capability and hazelcast jet seems one of the promising technology. However, most of the examples available are for real-time stream processing.
What we are looking for Can hazelcast jet be used for the batch-based workload for the volume we are supposed to process?
Apache Spark can be an alternative option. We are exploring alternatives which give scalability on demand and have low overhead
Hazelcast is appropriate for both batch and streaming workloads; if you dig into the Jet API a bit you'll see concepts such as 'BatchStage' and 'StreamStage', or 'BatchSource' and 'StreamSource'. A lot of the API is identical, but in particular the initial ingestion needs to know if it's reading from a batch or stream source, and certain concepts (like sorting) can't be applied to a potentially infinite streaming source.
The Hazelcast API is great at giving you opportunities to exploit concurrency, but if your input source is a single file that needs to be read sequentially that will be the limit to your throughput. If your use case is to read from multiple files, or you can have multiple readers per input file, that will help boost your throughput.