How to implement circular task list with Gearman?

1k Views Asked by At

I have a table in my MySQL database containing 200K records. Each record contains a URL that should be processed in some way. The URL processing in my case is not a trivial task, so I have chosen to use the Gearman queue to run these as background jobs.

So, for the each record (URL) in my table I plan to create separate task and supply it to Gearman.

Also, the data in my table is not static and very often new URLs will be added there.

According to my business logic I need to continuously process this list of urls. When I have completed processing of the last record in my DB table I should move to the first one and process should be repeated for the all records again.

So my questions:

  • How to better supply tasks to Gearman in this case?
  • Should I use cron or is it possible to organize logic where Gearman will automatically pull tasks?
  • How many tasks can be submitted in one time to Gearman?

So, could you please tell me how best to implement this system?

1

There are 1 best solutions below

4
On BEST ANSWER

Sounds that what you need is a queue, where the processed items are added back to the bottom of the queue. I suggest to organize the workflow like this:

  1. Once a new URL appears in your system, add it to Gearman background job queue.

  2. In the Gearman worker implementation, once the job is processed, add it to the queue again.

In this way, you will be constantly processing URLs in the order they were added to the queue and the whole queue will be infinitely reiterated. This assumes that you are repeatedly performing one task, of course.

If there's more than 1 task (e.g. first, do task #1 on all URLs, then do task #2, etc), you could follow a similar pattern, just send the jobs to the second queue (e.g. different worker) after the first task. Then, depending on how exactly the you want to order the work, you will either see everything happen automatically (if both workers are avaialable all the time) or you will need to monitor queue #1 and only start worker #2 when it is empty. For the details of such monitoring, see Any way to access Gearman administration?

In general, Gearman could easily and quickly handle 200,000 items. Now, using a persistent queue will slow things down a bit (it's essentially a MySQL / other DB connection), but should not do anything horrible. I have not tried it myself, but the success stories usually involve even more items and often a persistent queue, too.

The only thing that you need to be aware of is that Gearman does not allow processing jobs in batches (e.g. 10 items concurrently). As you are processing URLs, it means you will need to process 1 URL at a time, which is costly, as you will need to wait for each of them to be downloaded separately. You can avoid it either by using event-driven / non-blocking programming language for processing or you can take a look at beanstalkd, which allows such batch processing.