I am deploying some Flink jobs which require access to some services under a service mesh implemented via Linkerd and I'm running into this error:
java.lang.NoClassDefFoundError: Could not initialize class foo.bar.Job
I can confirm that the jar file contains the class that cannot be found apparently, so it's not a problem with the jar itself, but seems to be related to Linkerd. In particular, I'm using the following pod annotations for both the jobmanager and the taskmanager pods (taken from my Helm Chart values file):
podAnnotations:
linkerd.io/inject: enabled
config.linkerd.io/skip-outbound-ports: 6123,6124
config.linkerd.io/proxy-await: enabled
For what it's worth, I'm using the Ververica Platform (Community Edition) for deploying my jobs to Kubernetes, although I don't think the issue is VVP-specific:
{{- define "vvp.deployment" }}
kind: Deployment
apiVersion: v1
metadata:
name: my-job
spec:
template:
spec:
artifact:
kind: jar
flinkImageRegistry: {{ .Values.flink.imageRegistry }}
flinkVersion: "1.15.1"
flinkImageTag: 1.15.1-stream1-scala_2.12-java11-linkerd
entryClass: foo.bar.Job
kubernetes:
jobManagerPodTemplate:
metadata:
{{- with .Values.flink.podAnnotations }}
annotations:
{{- toYaml . | nindent 14 }}
{{- end }}
spec:
containers:
- name: flink-jobmanager
command:
- linkerd-entrypoint.sh
taskManagerPodTemplate:
metadata:
{{- with .Values.flink.podAnnotations }}
annotations:
{{- toYaml . | nindent 14 }}
{{- end }}
{{- end }}
where the contents of linkerd-entrypoint.sh are:
#!/bin/bash
set -e
exec linkerd-await --shutdown -- "$@"
For extra context, the VVP and the flink jobs are deployed into different namespaces. Also, for the VVP pods, I'm not using any linkerd annotations whatsoever.
Has anyone encountered similar problems? The closest troubleshooting resource/guide that I've found so far is this one, which targets Istio instead of Linkerd.
Answering to myself after having determined the root cause of the issue.
Regarding Linkerd, everything was correctly setup. The main precaution that one needs to take is adding the
linkerd-awaitbinary to the Flink image and making sure to override the entrypoint for the jobmanager since otherwise you will run into issues when upgrading your jobs. The jobmanager won't kill the Linkerd proxy, and because of that it will hang around withNotReadystatus. Again, that is easily solved by wrapping the main cmd in alinkerd-awaitcall. So, first add thelinkerd-awaitbinary to your docker image:Then, for the jobmanager only, override the entrypoint like this:
Alternatively one could use the
LINKERD_DISABLEDorLINKERD_AWAIT_DISABLEDenv vars for bypassing thelinkerd-awaitwrapper. For more info on using jobs & Linkerd consult the following resources:Also, regarding the annotation
, it does only the waiting but not the shutdown part, so if we are going to manually run
linkerd-await --shutdown -- "$@"anyway, that annotation can be safely removed since it's redundant:Finally, regarding:
let me clarify that this had nothing to do with Linkerd. This was mostly a config error along the lines of:
Essentially (the specific details are irrelevant), there were some env vars missing in the taskmanager pods. Note that the exception message says "Could not initialize class foo.bar.Job" which is different from "Could not find class...".
Sorry for the confusion!