Does RDD re computation on task failure cause duplicate data processing?

428 Views Asked by sanjay At 12 June 2021 at 18:24

When a particular task fails that causes RDD to be recomputed from lineage (maybe by reading input file again), how does Spark ensure that there is no duplicate processing of data? What if the task that failed had written half of the data to some output like HDFS or Kafka ? Will it re-write that part of the data again? Is this related to exactly once processing?

Original Q&A

There are 1 best solutions below

Ran Lupovich On 12 June 2021 at 18:37 BEST ANSWER

Output operation by default has at-least-once semantics. The foreachRDD function will execute more than once if there’s worker failure, thus writing same data to external storage multiple times. There’re two approaches to solve this issue, idempotent updates, and transactional updates. They are further discussed in the following sections

Does RDD re computation on task failure cause duplicate data processing?

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in RDD

Related Questions in EXACTLY-ONCE

Trending Questions

Popular # Hahtags

Popular Questions