How can i recover job by savepoint with multi-job run by executeAsync in application mode (flink 1.18)

40 Views Asked by At

i'm working on flink java with 1.18 version, and want to use Application-mode to run 2 jobs in one pod(k8s docker deployment).

In java code, i use a for statement to create 2 or more jobs with env.executeAsync, creating a new env in loop clause. Thus we can run multi parallel job in one docker pod, to reduce resource cost.

In application-mode, i think i cannot take over recovery with checkpoint, because we cannot enable HA in this mode, thus we cannot store the previous job id in Zookeeper to recover from checkpoint. Ref: https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/overview/#application-mode

So i want to recover by savepoint, when the docker pod is down or need to restart. My problems are:

  • how can i trigger savepoint for each job (now i run 2 jobs in one pod) every hour?
  • how can i recover from savepoint for each job when the docker pod restart? with java code or REST api.

i hope savepoint or checkpoint could help when i run multi-job in application-mode

1

There are 1 best solutions below

0
David Anderson On

FWIW, for Flink, "high availability" refers to the ability for a Flink cluster to recover from a failed job manager. Even without so-called high availability, I believe that jobs running in application mode can still rely on checkpoints to recover from task manager failures. And you can manually restart from a checkpoint in the event of a job manager failure (in the same way you would use a savepoint).

To use savepoints for recovery, you can use Flink's REST API to trigger and restart from savepoints.