how to efficiently traverse a directory and get the sha256 checksum for each file

1.1k Views Asked by At

I want to traverse any directory and been available to calculate the checkusum of each file, currently I am using python multiprocessing and this following code:

import hashlib
import os
import time

from multiprocessing import Pool


def list_files(path):
    directories = []
    files = []

    def append_files(x):
        files.append(x)

    pool = Pool()

    src = os.path.abspath(os.path.expanduser(path))
    for root, dirs_o, files_o in os.walk(src):
        for name in dirs_o:
            directories.append(os.path.join(root, name))
        for name in files_o:
            file_path = os.path.join(root, name)
            if os.path.isfile(file_path):
                pool.apply_async(
                    sha256_for_file,
                    args=(file_path,),
                    callback=append_files)

    pool.close()
    pool.join()

    return directories, files

def sha256_for_file(path, block_size=4096):
    try:
        with open(path, 'rb') as rf:
            h = hashlib.sha256()
            for chunk in iter(lambda: rf.read(block_size), b''):
                h.update(chunk)
        return h.hexdigest(), path
    except IOError:
        return None, path

if __name__ == '__main__':
    start_time = time.time()

    d, f = list_files('~')
    print len(f)

    print '\n' + 'Elapsed time: ' + str(time.time() - start_time)      

The code is using python apply_async, I tried using map and also map_async but don't see any improvements in terms of speed, I also tried ThreadPool but became more slower.

from multiprocessing.pool import ThreadPool

pool = TreadPool()
...

Any ideas of how to optimize the code or improve it in order to be available to traverse huge directories and calculate the checksum of every file using python 2.7?

On a MacBook Pro (3GHz Intel Core i7, 16 GB RAM 1600 MHz DDR3, SSD disk) calculating the hash for all files (215658) in the user home '~' took: 194.71100688 seconds.

2

There are 2 best solutions below

2
On

Let's have a closer look at the multithreading part. What does your program do?

  1. traverse directories
  2. open files and calculate their checksum

1 and 2 require concurrent disk access, while only 2 performs actual calculations. Using different thread for steps 1 and 2 wouldn't increase speed, because of this concurrent disk access. But 2 could be split in two distinct steps:

  1. traverse directories
  2. open files and read their contents
  3. calculate checksum of contents

1 and 2 could belong to one thread (disk access, writing to memory), while 3 could be performed in a separate one (reading memory, CPU calculation).

Still, I am not sure you would get a huge performance gain, as hash computation is generally not so CPU-intensive: most of the computation time might be used for disk reading...

0
On

Try measuring collective time of execution of function sha256_for_file.

If it is near 190 s, then this is the piece of code you should optimize or parallelize (reading chunks in one thread, calculating in second thread).