I am working on a web backend / API provider that grabs realtime data from a 3rd party web API, puts it in a MySQL database and makes it available over an HTTP/JSON API.
I am providing the API with flask and working with the DB using SQLAlchemy Core.
For the realtime data grabbing part, I have functions that wrap the 3rd party API by sending a request, parsing the returned xml into a Python dict and returning it. We'll call these API wrappers.
I then call these functions within other methods which take the respective data, do any processing if needed (like time zone conversions etc.) and put it in the DB. We'll call these processors.
I've been reading about asynchronous I/O and eventlet specifically and I'm very impressed.
I'm going to incorporate it in my data grabbing code, but I have some questions first:
is it safe for me to monkey patch everything? considering I have flask, SQLAlchemy and a bunch of other libs, are there any downsides to monkey patching (assuming there is no late binding)?
What is the granularity I should divide my tasks to? I was thinking of creating a pool that periodically spawns processors. Then, once the processor reaches the part where it calls the API wrappers, the API wrappers will start a GreenPile for getting the actual HTTP data using eventlet.green.urllib2. Is this a good approach?
- Timeouts - I want to make sure no greenthreads ever hang. Is it a good approach to set the eventlet.Timeout to 10-15 seconds for every greenthread?
FYI, I have about 10 different sets of realtime data, and a processor is spawned every ~5-10 seconds.
Thanks!
I don't think it's wise to mix Flask/SQLAlchemy with an asynchronous style (or event driven) programming model.
However, since you state that you are using a RDBMS (MySQL) as intermediary storage, why don't you just create asynchronous workers that store the results from your third party webservices in the RDMBS, and keep your frontend (Flask/SQLAlchemy) synchronous?
In that case you don't need to monkeypatch Flask or SQLAlchemy.
Regarding the granularity, you may want to use the mapreduce paradigm to perform the web API calls and processing. This pattern may give you some idea on how to logically seperate the consecutive steps, and how to control the processes involved.
Personally, I wouldn't use an asynchronous framework for doing this though. It may be better to use either multiprocessing, Celery, or a real mapreduce kind of system like Hadoop.
Just a hint: start small, keep it simple and modular and optimize later if you are requiring better performance. This may also be heavily influenced by how realtime you want the information to be.