I have a setup where I run long idempotent tasks on AWS spot instances but I can't work out how to set up Celery to elegantly handle workers being killed mid task.
At the moment if a worker is killed the task is marked as failed (WorkerLostError). I found the documentation on the subject to be a bit lean, but it suggests that you should use CELERY_ACKS_LATE for this scenario. This isn't working for me, the task is still marked as failed.
When I had CELERY_ACKS_LATE=False the task just stayed stuck as PENDING - so at least now I can tell that it has failed - which is a good start.
Here are my config settings at the moment:
# I'm using rabbit-mq as the broker
BROKER_HEARTBEAT = 10
CELERY_ACKS_LATE = True
CELERYD_PREFETCH_MULTIPLIER = 1
CELERY_TRACK_STARTED = True
I have a task spinning on a master server that checks for the results of outstanding tasks and handles updating my local db to mark the tasks as complete (and performs work with the results). At this stage I think I'm going to have to catch the 'Worker exited prematurely: signal 15 (SIGTERM)' scenario and retry the task.
It feels like this should all be handled by celery, so I feel like I've missed something fundamental in my config.
Given idempotent tasks and workers that will fail, what is the best way to configure celery so that those tasks are picked up by a different worker?