How to avoid shuffle in a Google Dataflow streaming pipeline that writes Fixed window outputs

28 Views Asked by arunmahadevan At 08 March 2024 at 21:48

Looking to implement a Dataflow (or flink) streaming pipeline that reads from pub-sub, transforms the data to parquet and writes output every few mins.

Does this require a Fixed window? If so does all the events in a window finally gets sent to a single worker (the Fixed window worker)

If a key is added before the window, it will results in a shuffle. How to avoid this?

i.e. Looking to implement something that can scale to N workers, independently processing and writing out N files, with no shuffle. Its a simple ETL flow.

Original Q&A

There are 1 best solutions below

kkrugler On 09 March 2024 at 23:52

You don't need to use a window. Just have a workflow with a source, followed by a Map (or FlatMap) function, and then your sink.

This works well if your job's parallelism matches the input parallelism (e.g. number of partitions in Kafka)...not sure how that works with whatever pub-sub system you're using.

If you need to generate N output files where N doesn't match your source parallelism, then you'll have to do some kind of shuffle. You can set the parallelism of your sink to be whatever you want, and Flink will automatically do a rebalance (shuffle) to deal with having a different source vs. sink parallelism.

How to avoid shuffle in a Google Dataflow streaming pipeline that writes Fixed window outputs

There are 1 best solutions below

Related Questions in APACHE-FLINK

Related Questions in GOOGLE-CLOUD-DATAFLOW

Related Questions in APACHE-BEAM

Trending Questions

Popular # Hahtags

Popular Questions