nomad fails release CSI volume during "restart -reschedule" which would move allocations to new host

121 Views Asked by At

Context:

  • Have nomad job with configured tasks dependent on CSI (AWS EBS) volumes across three host machines.
  • The allocations start and the service works. The volumes work and data is stored there.
  • nomad stop|start|restart all work. These commands (usually) restart the allocation on the same host machine.

Problem:

  • When nomad restart -reschedule is run and there is a new, available, nomad host machine, nomad fails to release the CSI mount after an individual allocation has stopped.

From what I can tell, nomad doesn't even try to release the volume. There's no "failed to release" message in any log file (nomad server, nomad client, ebs controller, ebs node).

The first error I see anywhere is this:

[ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"

which occurs on the new node as it attempts to mount the volume.

At this point the previous allocation is dead/stopped, but the volume still mounted on the previous host. And the volume is marked as unavailable.

1

There are 1 best solutions below

0
On
  • First, upgrade both Nomad and the AWS EBS CSI plugin to their latest versions. Newer versions often address compatibility issues and bug fixes like this one.
  • Then, Identify the stuck volume: Find the allocation ID and volume ID of the problematic volume. Detach the volume manually: Use the AWS CLI or EBS API to detach the volume from the old host. Ensure the previous allocation is indeed stopped before proceeding. Clear "max claims reached" error: Delete the volume claim associated with the allocation on the new host. This will remove the claim and allow Nomad to mount the volume again.
  • Lastly, Consider using Nomad's -force flag with restart -reschedule to try forcing the release of volumes. However, use this cautiously as it might lead to data loss if not handled carefully.