Problem Statement
I'm currently building an exchange scraper with three tasks, each running on its own process
:
- #1: Receive a live webfeed: very fast data coming in, immediately put in a
multiprocessing
Queue and continue. - #2: Consume queue data and optimize: consume and optimize it using some logic I wrote. Is slow but not too slow, eventually catches up and clears queue when data coming in is slow.
- #3: Compress feed using
bz2
and upload to my s3 bucket: Every hour, i compress the optimized data (to reduce file size even more) and then upload to my s3 bucket. This takes about 10-20 seconds on my machine.
The problem I'm having is that each of these tasks needs its own parallel process
. The producer (#1) can't do the optimization (#2), otherwise it stalls the feed connection and the website kills my socket because thread #1 doesn't respond. The uploader (#3) can't be run on the same process as task #2 otherwise I'll fill up the queue too much, and I can never catch up. I've tried this: doesn't work.
This scraper works just fine on my local machine with each task on its own process. But I really don't want to spend a lot of money on a 3-core machine when this is deployed on a server. I found Digital Ocean's 4vCPU option is the cheapest at $40/mo. But I was wondering if there is a better way than paying for 4-cores.
Just some stuff to note: On my base 16" MBP, Task #1 uses 99% CPU, Task #2 uses 20-30% CPU, Task #3 sleeps until the turn of the hour, so it mostly uses 0.5-1% CPU.
Questions:
If I run three processes on a 2-core machine, is that effectively the same as running two processes? I know it depends on system scheduling, but does that mean it will stall on compression, or move along until compression is over? It seems really wasteful to spin up (and pay for) an entirely new core that only is used once an hour. But that hourly task stalls the entire queue too much and I'm not sure how to get around that.
Is there anyway I can continue Task#2 while I compress my files on the same process/core?
If I run a bash script to do the compression instead, would that still stall the software? My computer is 6-core so I can't really test the server's constraint locally
Are there any cheaper alternatives to DigitalOcean? I am honestly terrified from AWS because I've heard horror stories of people getting $1,000 bills for unexpected usage. I'd rather something more predictable like DigitalOcean
What I've Tried
As mentioned before, I tried combining Task#2 and Task#3 on the same process. It ends up stalling once the compression begins. Compression is synchronous and done using the code from this thread. Couldn't find asynchronous bz2 compression, but I'm not sure that would even help not stalling Task#2.
PS: I really tried to avoid coming to StackOverflow with an open question like this because I know these get bad feedback, but the alternative is trying out and putting a lot of time+money on the line when I don't know much about cloud computing to be honest. I'd prefer some expert opinions
bullet point #1:
All operating systems you'll run into use preemptive scheduling to switch between processes. This should guarantee each process gets resumed at least several times a second on any remotely modern hardware (as long as the process is using the cpu, and not waiting on an interrupt like file or socket io). Basically, it's not a problem at all to run even hundreds of processes on a 2 core cpu. If the total load is too much, everything will run slower, but nothing should completely stall.
bullet point #2:
Multithreading? you may find compressing / storing to be more IO limited, so a thread would probably be fine here. You may even see a benefit from reduced overhead from transferring data between processes (depending on how you currently do it) as a child thread has full access to the memory space of the parent.
bullet point #3:
A shell script is just another process, so not too different to answer #1. Do test this however, as python bzip may very well be much slower than shell bzip (depending on how you feed it data, and where it's trying to put it)...
bullet point #4:
Definitely not an appropriate question for S.O.
My recommendation:
Profile your code... Make the ingest process as efficient as possible, and send as little data between processes as possible. A process that is merely reading data from a socket, and sending it to be processed should be taking minimal cpu. The default
multiprocessing.Queue
isn't terribly efficient because it pickles data, sends it through a pipe, then unpickles it at the other end. If your data can be chunked into fixed size chunks, consider using a couplemultiprocessing.shared_memory.SharedMemory
buffers to swap between. Chunking the data stream should also make it easier to parallelize the data consumption stage to better utilize whatever cpu resources you have.edit: pseudocodeish example of sending chunks of data via shared memory