I want to traverse any directory and been available to calculate the checkusum of each file, currently I am using python multiprocessing and this following code:
import hashlib
import os
import time
from multiprocessing import Pool
def list_files(path):
directories = []
files = []
def append_files(x):
files.append(x)
pool = Pool()
src = os.path.abspath(os.path.expanduser(path))
for root, dirs_o, files_o in os.walk(src):
for name in dirs_o:
directories.append(os.path.join(root, name))
for name in files_o:
file_path = os.path.join(root, name)
if os.path.isfile(file_path):
pool.apply_async(
sha256_for_file,
args=(file_path,),
callback=append_files)
pool.close()
pool.join()
return directories, files
def sha256_for_file(path, block_size=4096):
try:
with open(path, 'rb') as rf:
h = hashlib.sha256()
for chunk in iter(lambda: rf.read(block_size), b''):
h.update(chunk)
return h.hexdigest(), path
except IOError:
return None, path
if __name__ == '__main__':
start_time = time.time()
d, f = list_files('~')
print len(f)
print '\n' + 'Elapsed time: ' + str(time.time() - start_time)
The code is using python apply_async
, I tried using map
and also map_async
but don't see any improvements in terms of speed, I also tried ThreadPool
but became more slower.
from multiprocessing.pool import ThreadPool
pool = TreadPool()
...
Any ideas of how to optimize the code or improve it in order to be available to traverse huge directories and calculate the checksum of every file using python 2.7?
On a MacBook Pro (3GHz Intel Core i7, 16 GB RAM 1600 MHz DDR3, SSD disk) calculating the hash for all files (215658) in the user home '~' took: 194.71100688 seconds.
Let's have a closer look at the multithreading part. What does your program do?
1 and 2 require concurrent disk access, while only 2 performs actual calculations. Using different thread for steps 1 and 2 wouldn't increase speed, because of this concurrent disk access. But 2 could be split in two distinct steps:
1 and 2 could belong to one thread (disk access, writing to memory), while 3 could be performed in a separate one (reading memory, CPU calculation).
Still, I am not sure you would get a huge performance gain, as hash computation is generally not so CPU-intensive: most of the computation time might be used for disk reading...