Looking to implement a Dataflow (or flink) streaming pipeline that reads from pub-sub, transforms the data to parquet and writes output every few mins.
Does this require a Fixed window? If so does all the events in a window finally gets sent to a single worker (the Fixed window worker)
If a key is added before the window, it will results in a shuffle. How to avoid this?
i.e. Looking to implement something that can scale to N workers, independently processing and writing out N files, with no shuffle. Its a simple ETL flow.
You don't need to use a window. Just have a workflow with a source, followed by a Map (or FlatMap) function, and then your sink.
This works well if your job's parallelism matches the input parallelism (e.g. number of partitions in Kafka)...not sure how that works with whatever pub-sub system you're using.
If you need to generate N output files where N doesn't match your source parallelism, then you'll have to do some kind of shuffle. You can set the parallelism of your sink to be whatever you want, and Flink will automatically do a rebalance (shuffle) to deal with having a different source vs. sink parallelism.