Problem with nomad job deployment (raw_exec mode, v1.0.1)

1k Views Asked by At

Recent update from nomad v.0.9.6 to nomad v.1.01 breaks a job deployment. Unfortunately I couldn't get any usable info from nomad agent about "pending or dead" status. I also checked trace monitor from web-ui but without success.

Please could you give some advice on how to get reject/pending reason from the agent?

I use "raw_exec" driver (non-privileged user, driver.raw_exec.enable" = "1") F or deployment I use nomad-sdk (version 0.11.3.0)

You can find the job definition (from the nomad's point of view) here:

https://pastebin.com/ZXiaM9RW

OS details:

cat /etc/redhat-release 
CentOS Linux release 7.4.1708 (Core) 
Linux blade1.lab.bulb.hr 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Nomad agent details:

[root@blade1 ~]# nomad node-status
ID        DC   Name                Class   Drain  Eligibility  Status
5838e8b0  dc1  blade1.lab.bulb.hr  <none>  false  eligible     ready

Verbose output:

[root@blade1 ~]# nomad node-status -verbose
ID                                    DC   Name                Class   Address         Version  Drain  Eligibility  Status
5838e8b0-ebd3-5c47-a949-df3d601e0da1  dc1  blade1.lab.bulb.hr  <none>  192.168.112.31  1.0.1    false  eligible     ready
[root@blade1 ~]# nomad node-status -verbose 5838e8b0-ebd3-5c47-a949-df3d601e0da1
ID              = 5838e8b0-ebd3-5c47-a949-df3d601e0da1
Name            = blade1.lab.bulb.hr
Class           = <none>
DC              = dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 1516h1m31s

Drivers
Driver    Detected  Healthy  Message                             Time
docker    false     false    Failed to connect to docker daemon  2020-12-18T14:37:09+01:00
exec      false     false    Driver must run as root             2020-12-18T14:37:09+01:00
java      false     false    Driver must run as root             2020-12-18T14:37:09+01:00
qemu      false     false    <none>                              2020-12-18T14:37:09+01:00
raw_exec  true      true     Healthy                             2020-12-18T14:37:09+01:00

Node Events
Time                       Subsystem  Message          Details
2020-12-18T14:37:09+01:00  Cluster    Node registered  <none>

Allocated Resources
CPU          Memory      Disk
0/18000 MHz  0 B/53 GiB  0 B/70 GiB

Allocation Resource Utilization
CPU          Memory
0/18000 MHz  0 B/53 GiB

Host Resource Utilization
CPU            Memory         Disk
499/20000 MHz  33 GiB/63 GiB  (/dev/mapper/vg00-root)

Allocations
No allocations placed

Attributes
consul.datacenter         = dacs
consul.revision           = 1e03567d3
consul.server             = true
consul.version            = 1.8.5
cpu.arch                  = amd64
driver.raw_exec           = 1
kernel.name               = linux
kernel.version            = 3.10.0-693.21.1.el7.x86_64
memory.totalbytes         = 67374776320
nomad.advertise.address   = 192.168.112.31:5656
nomad.revision            = c9c68aa55a7275f22d2338f2df53e67ebfcb9238
nomad.version             = 1.0.1
os.name                   = centos
os.signals                = SIGTTIN,SIGUSR2,SIGXCPU,SIGBUS,SIGILL,SIGQUIT,SIGCHLD,SIGIOT,SIGKILL,SIGINT,SIGSTOP,SIGSYS,SIGTTOU,SIGFPE,SIGSEGV,SIGTSTP,SIGURG,SIGWINCH,SIGCONT,SIGIO,SIGTRAP,SIGXFSZ,SIGHUP,SIGPIPE,SIGTERM,SIGPROF,SIGABRT,SIGALRM,SIGUSR1
os.version                = 7.4.1708
unique.cgroup.mountpoint  = /sys/fs/cgroup/systemd
unique.consul.name        = grabber1
unique.hostname           = blade1.lab.bulb.hr
unique.network.ip-address = 192.168.112.31
unique.storage.bytesfree  = 74604830720
unique.storage.bytestotal = 126698909696
unique.storage.volume     = /dev/mapper/vg00-root

Meta
connect.gateway_image     = envoyproxy/envoy:v${NOMAD_envoy_version}
connect.log_level         = info
connect.proxy_concurrency = 1
connect.sidecar_image     = envoyproxy/envoy:v${NOMAD_envoy_version}

Job status details

[root@blade1 ~]# nomad status
ID                                     Type     Priority  Status   Submit Date
lightningCollector-lightningCollector  service  50        pending  2020-12-18T15:06:09+01:00


[root@blade1 ~]# nomad status lightningCollector-lightningCollector
ID            = lightningCollector-lightningCollector
Name          = lightningCollector-lightningCollector
Submit Date   = 2020-12-18T15:06:09+01:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group                               Queued  Starting  Running  Failed  Complete  Lost
lightningCollector-lightningCollector-0  0       0         0        0       0         0

Allocations
No allocations placed

Thank you for your effort and time! Regards, Ivan

1

There are 1 best solutions below

1
On BEST ANSWER

I tested your job locally and was able to reproduce your experience. I noticed that ParentID was set in the job, which is used by Nomad to track child instances of periodic or dispatch jobs.

After setting the ParentID value to "", I was able to submit the job and it evaluated and scheduled properly.

I did some testing over the versions and determined the behavior changed in 0.12.0 and 0.12.1. I filed hashicorp/nomad #10422 in response to this difference in behavior.