gearman and retrying workers with unreliable external dependencies

703 Views Asked by At

I'm using gearman to queue a variety of different jobs, some which can always be serviced immediately, and some which can "fail", because they require an unreliable external service. (For example, sending email might require an SMTP server that's frequently unavailable.)

If an external service goes down, I'd like to keep all jobs which require that service on the queue, and retry one job occasionally (every few minutes, say) until the service becomes available again. (Perhaps optionally sending email if the service has not been available for hours.)

However I'd like jobs that don't require a failed service to be passed on to workers as soon as possible. How can this be achieved? (I'm happy to put some of the logic in the workers if necessary, although it seems to be a bit "late" to throttle on the worker side.)

1

There are 1 best solutions below

2
On

Gearman should already be handle this. As long as you have some workers which specialise in handling jobs with unreliable dependancies and don't handle other jobs, along with some workers that either do all jobs, or just jobs without unreliable dependencies.

All you would need to do it add some code the unreliable dependancy workers so that they only accept jobs once that have checked that the dependent service is running, if the service is down then just have them wait a bit and retest the service (and continue ad infinitum), once the service is up then have them join the gearmand server, do job, return work, retest service, etc etc.

While the dependent service is down, the workers that don't handle jobs that need the service will keep on trundling through the job queue for the other jobs. Gearmand won't block an entire job queue (or worker) on one job type if there are workers available to handle other job types.

The key is to be sensible about how you define your job types and workers.

EDIT--

Ah-ha, I knew my thinking was a little out, (I wrote my gearman system about a year ago and haven't really touched it since). My solution to this type of issue was to have all the workers that normally handle dependent-job unregister their dependent job handling capability with the gearmand server once a failure was detected with the dependent service. (and any workers that are currently trying to complete that job should return a failure.) Once the service is backup - get those same workers to reregister their ability to handle that job. Do note this does require another channel of communications for the workers to be notified of the status of the dependent services.

Hope this helps