Is it feasible to unpickle large amounts of data in a Python web app that needs to be scalable?

1.1k Views Asked by At

I am working on my first web application. I am doing it in Python with Flask, and I want to run a piece of Python code that takes an object, with Pickle, from a (binary) file of several MB which contains all the necessary data. I have been told that so using Pickle in a web app is not a good idea because of scalability; why?

Obviously, for a given purpose it's better to take just the necessary data. However, if I do this, with an Elasticsearch database and in the fastest way I can, the whole process takes about 100 times more time than if I take all the data at once from the binary; once the binary is unpickled, which will take at most one second, the computations are very fast, so I am wondering if I should use the binary file, and if so, how to do it in a scalable way.

2

There are 2 best solutions below

8
On BEST ANSWER

So this is something I also have to deal with. In my case it's even worse, mine can be 100s of MB or more.

My first questions would be:

  • Does the pickled data change and if so, where / by whom can it be changed?

  • Are there multiple sets of this data that you need? (for me I have 1000s of them needed at different times by different people)

  • How often is it needed?

The answers to those questions really raise different ways of approaching it.

Assuming you just have a big blob of stuff that's needed on a bunch of requests and is the same for everybody, and you know you'll need it soon enough - I'd load it either when the app starts and keep it in memory or lazy load it when first requested (and then keep it in memory).

The alternative approach is to split the heavy data bit into its own flask app.

# api.py: your api flask application
from flask import Flask, jsonify, request

api_app = Flask(__name__)

big_gis_object = unpickle(...)

@api_app.route('/find_distance')
def find_distance():
    # grabbing the parameters for this request
    lat, lon = request.args['lat'], request.args['lon']

    # do your normal geo calculations here
    distance = big_gis_object.do_dist_calcs(lat, lon)

    # return the result as json to make things easy
    return jsonify(distance=distance)


# app.py: your main flask application
import requests
from flask import Flask, render_template

main_app = Flask(__name__)

@main_app.route('/')
def homepage():
    # this is how you ask the geo api app to do something for you
    # note that we're using the requests library do make it easier
    # - http://docs.python-requests.org/en/latest/user/quickstart/
    resp = requests.get('http://url_to_flask_app/find_distance', params=dict(lat=1.5, lon=1.7))
    distance = resp.json()['distance']
    return render_template('homepage.html', distance)

How you then provision those will depend on the load / requirements. It's flexible though. You can have, say, 40 processes for your main front and just the 1 api process (though it will only be able to do one thing at a time). If you need more of the api processes just scale that till you have the right balance. The tradeoff is that the api processes need more memory.

Does that all make sense?

0
On

Pickle files is good way to load data in python. By the way consider using the C implementation: cPickle.

If you want to scale, using pickle files. Ideally you want to look for a partition key, that fits for your data and your project needs.

Let's say for example you have historical data, instead of having a single file with all historical data, you can create a pickle file per date.