I've a MapReduce job which I run using job.waitForCompletion(true)
. If one/more reducer task gets killed or crashes during the execution of the job, the entire MapReduce job is restarted and mappers and reducers are executed again (documentation). Here are my questions:
1] Can we know at the start of the job if the job has started fresh or it has restarted because of some failure in the previous run? (This led me to Q2)
2] Can counters help? Does value of counters gets carried over if some tasks fail, which leads to restart of the whole job?
3] Is there any inbuilt checkpointing method provided by Hadoop which keeps track of previous computation and helps avoid doing the same computations done by mappers and reducers before failing/crashing?
Sorry, if the questions are not phrased unclearly. Thanks for the help.
Some correction to the terminology. A job does not restart if one or more of its tasks fail. A task may get restarted. From a mapper/reducer context you can get https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/TaskAttemptContext.html#getTaskAttemptID() which contains the attempt number as a last token of the id.
Counter updates from failed task attempts are not aggregated in job totals, so there should be no fear of overcounting.
Generally not. Output of failed task is cleared by the framework. If you are afraid to loose something that is expensive to calculate because of a task failure, I would recommend to split your job into multiple map/reduce phases. You can also have your own mutable distributed cache, but that's not recommended either.