How to reduce number of checkpoint files writen by spark streaming

1.3k Views Asked by Warren Zhu At 08 February 2022 at 01:06

If spark streaming job involves shuffle and stateful processing, it's easy to generate lots of small files per micro batch. We should decrease the number of files without hurting latency.

Original Q&A

There are 1 best solutions below

Warren Zhu On 08 February 2022 at 01:12

If using all default configs, one spark streaming micro batch will generate 80 k files. This will casue high qps and latency for hdfs. Better change below configs to reduce checkpoint files.

Config	Default	Suggested
`spark.sql.streaming.minBatchesToRetain`	100	30
`spark.sql.streaming.stateStore.minDeltasForSnapshot`	10	5
`spark.sql.shuffle.partitions`	200	Depends on micro batch size, 50 or 100

So, total number of files = minBatchesToRetain * 4 (left 2 + right 2) * partitions * operators(each join or aggregation)

If all config are default, it will be 100 * 4 * 200 * 1 = 80 K

How to reduce number of checkpoint files writen by spark streaming

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in SPARK-STRUCTURED-STREAMING

Related Questions in SPARK-CHECKPOINT

Trending Questions

Popular # Hahtags

Popular Questions