Python Multi-threading CPU workload

852 Views Asked by At

We tried to parallelize our program in Python by using threads. The problem is, we don't get 100% of the CPU used. The CPU uses all 8 cores but only on usage of roundabout 50-60% sometimes lower. Why does the CPU not work with a 100% workload on the calculation?

We are programming in Python on Windows.

Here is our implementation for the multithreading:

from threading import Thread
import hashlib

class CalculationThread(Thread):
    def init(self, target: str):
        Thread.init(self)
        self.target = target

    def run(self):
        for i in range(1000):
            hash_md5 = hashlib.md5()
            with open(str(self.target), "rb") as f:
                for chunk in iter(lambda: f.read(4096), b""):
                    hash_md5.update(chunk)
            f = hash_md5.hexdigest()
        print(self.getName() + "Finished")

threads = []
for i in range(20):
    t = CalculationThread(target="baden-wuerttemberg-latest.osm.pbf")
    print("Worker " + str(t.getName()) + " started")
    t.start()
    threads.append(t)

for t in threads:
    t.join()

CPU workload while running the calculation:

CPU workload while running the calculation

1

There are 1 best solutions below

0
On

Because of the existence of GIL, python is not able to achieve true "parallel" on multiple cores with multi threading, especially for Compute-Intensive tasks.

You get some improvement because your task is also somehow bounded by IO(you read from the disk).

One way to figure out what your program is doing in multiple thread is to use some multi-thread supporting tool like VizTracer. It will tell you how much time is spent in your md5 calculation.

However, the correct way to do it in real parallel, is to use multiprocessing library, probably a Pool to do it in multi process, instead of multi thread.