how to efficiently traverse a directory and get the sha256 checksum for each file

Question

how to efficiently traverse a directory and get the sha256 checksum for each file

1.1k Views Asked by nbari At 27 July 2025 at 13:20

I want to traverse any directory and been available to calculate the checkusum of each file, currently I am using python multiprocessing and this following code:

import hashlib
import os
import time

from multiprocessing import Pool


def list_files(path):
    directories = []
    files = []

    def append_files(x):
        files.append(x)

    pool = Pool()

    src = os.path.abspath(os.path.expanduser(path))
    for root, dirs_o, files_o in os.walk(src):
        for name in dirs_o:
            directories.append(os.path.join(root, name))
        for name in files_o:
            file_path = os.path.join(root, name)
            if os.path.isfile(file_path):
                pool.apply_async(
                    sha256_for_file,
                    args=(file_path,),
                    callback=append_files)

    pool.close()
    pool.join()

    return directories, files

def sha256_for_file(path, block_size=4096):
    try:
        with open(path, 'rb') as rf:
            h = hashlib.sha256()
            for chunk in iter(lambda: rf.read(block_size), b''):
                h.update(chunk)
        return h.hexdigest(), path
    except IOError:
        return None, path

if __name__ == '__main__':
    start_time = time.time()

    d, f = list_files('~')
    print len(f)

    print '\n' + 'Elapsed time: ' + str(time.time() - start_time)

The code is using python apply_async, I tried using map and also map_async but don't see any improvements in terms of speed, I also tried ThreadPool but became more slower.

from multiprocessing.pool import ThreadPool

pool = TreadPool()
...

Any ideas of how to optimize the code or improve it in order to be available to traverse huge directories and calculate the checksum of every file using python 2.7?

On a MacBook Pro (3GHz Intel Core i7, 16 GB RAM 1600 MHz DDR3, SSD disk) calculating the hash for all files (215658) in the user home '~' took: 194.71100688 seconds.

Original Q&A

There are 2 best solutions below

**Mathieu Rodic** · Answer 1

Let's have a closer look at the multithreading part. What does your program do?

traverse directories
open files and calculate their checksum

1 and 2 require concurrent disk access, while only 2 performs actual calculations. Using different thread for steps 1 and 2 wouldn't increase speed, because of this concurrent disk access. But 2 could be split in two distinct steps:

traverse directories
open files and read their contents
calculate checksum of contents

1 and 2 could belong to one thread (disk access, writing to memory), while 3 could be performed in a separate one (reading memory, CPU calculation).

Still, I am not sure you would get a huge performance gain, as hash computation is generally not so CPU-intensive: most of the computation time might be used for disk reading...

**user2622016** · Answer 2

Try measuring collective time of execution of function sha256_for_file.

If it is near 190 s, then this is the piece of code you should optimize or parallelize (reading chunks in one thread, calculating in second thread).

how to efficiently traverse a directory and get the sha256 checksum for each file

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in MULTITHREADING

Related Questions in PYTHON-2.7

Related Questions in MULTIPROCESSING

Related Questions in CHECKSUM

Trending Questions

Popular # Hahtags

Popular Questions