Large memory Python background jobs

1.4k Views Asked by At

I am running a Flask server which loads data into a MongoDB database. Since there is a large amount of data, and this takes a long time, I want to do this via a background job.

I am using Redis as the message broker and Python-rq to implement the job queues. All the code runs on Heroku.

As I understand, python-rq uses pickle to serialise the function to be executed, including the parameters, and adds this along with other values to a Redis hash value.

Since the parameters contain the information to be saved to the database, it quite large (~50MB) and when this is serialised and saved to Redis, not only does it take a noticeable amount of time but it also consumes a large amount of memory. Redis plans on Heroku cost $30 p/m for 100MB only. In fact I every often get OOM errors like:

OOM command not allowed when used memory > 'maxmemory'.

I have two questions:

  1. Is python-rq well suited to this task or would Celery's JSON serialisation be more appropriate?
  2. Is there a way to not serialise the parameter but rather a reference to it?

Your thoughts on the best solution are much appreciated!

2

There are 2 best solutions below

1
On BEST ANSWER

Since you mentioned in your comment that your task input is a large list of key value pairs, I'm going to recommend the following:

  • Load up your list of key/value pairs in a file.
  • Upload the file to Amazon S3.
  • Get the resulting file URL, and pass that into your RQ task.
  • In your worker task, download the file.
  • Parse the file line-by-line, inserting the documents into Mongo.

Using the method above, you'll be able to:

  • Quickly break up your tasks into manageable chunks.
  • Upload these small, compressed files to S3 quickly (use gzip).
  • Greatly reduce your redis usage by requiring much less data to be passed over the wires.
  • Configure S3 to automatically delete your files after a certain amount of time (there are S3 settings for this: you can have it delete automatically after 1 day, for instance).
  • Greatly reduce memory consumption on your worker by processing the file one line at-a-time.

For use cases like what you're doing, this will be MUCH faster and require much less overhead than sending these items through your queueing system.

Hope this helps!

0
On

It turns out that the solution that worked for is to save the data to Amazon S3 storage, and then pass the URI to function in the background task.