Efficient way of iterating through large amounts of data in Python

108 Views Asked by At

I'm trying to run a dictionary attack on a sha512 hash. I know the hash is made up of two words, all lowercase, separated by a space. The words are from a known dictionary (02-dictionary.txt), which contains 172,820 words. Currently, my code is the following:

import hashlib
import sys
import time

def crack_hash(word, target):
    dict_hash = hashlib.sha512(word.encode())
    if dict_hash.hexdigest() == target:
        return (True, word)
    else:
        return (False, None)

if __name__ == "__main__":
    target_hash = sys.argv[1].strip()
    
    fp = open("02-dictionary.txt", "r")

    words = []
    start_time = time.time()
    for word in fp:
        words.append(word)
    fp.close()

    for word1 in words:
        for word2 in words:
            big_word = word1.strip() + " " + word2.strip()
            print(big_word)
            soln_found, soln_word = crack_hash(big_word.strip(), target_hash)
            if soln_found:
                print('Solution found')
                print("The word was:", soln_word)
                break

    end_time = time.time()
    total_time = end_time - start_time
    print("Time taken:", round(total_time, 5), "seconds")

However, when I run this code, the program runs incredibly slow. I know Python isn't the most efficient language, but I'm guessing the issue is more stemming from the data structure choice. Is there a more efficient data structure? I tried to use the array module, but the documentation makes it seem as though that is designed to be used for more primitive types (ints, floats, shorts, bools, chars, etc), and not for more complicated types like strings (or lists of chars). What is the best way of improving this code? In about an hour of runtime, I only got through around 1% of all the possible word combinations.

1

There are 1 best solutions below

0
On BEST ANSWER

The problem is that you're computing 1780002 = 31684000000 (about 235) hashes. That's a lot of work. I've made a few changes to achieve some optimizations in pure python, but I suspect that the overhead of the hashlib calls is pretty significant. I think doing this all in native code would result in a more dramatic speedup.

The optimizations include the following:

  • Precomputing the words in the dictionary into bytes objects
  • Precomputing the partial hash result of the first part of the hash
import hashlib
import sys
import time


def try_all(words, target_hash):
    for word1 in words:
        hash_prefix = hashlib.sha512(word1 + b' ')
        for word2 in words:
            prefix_copy = hash_prefix.copy()
            prefix_copy.update(word2)
            # print(big_word)
            if prefix_copy.digest() == target_hash:
                print('Solution found')
                big_word = (word1 + b' ' + word2).decode('utf8')
                print(f'The word was: {big_word}')
                return


def read_all_words(filename):
    with open(filename, "rt") as f:
        return [line.strip().encode('utf-8') for line in f]


def get_test_hash(words):
    phrase = words[-2] + b' ' + words[-1]  # pick target towards end
    return hashlib.sha512(phrase).digest()


if __name__ == "__main__":
    words = read_all_words("02-dictionary.txt")
    TESTING = True
    if TESTING:
        words = words[:5000]  # reduce the size of the word list for testing only
        target_hash = get_test_hash(words)
    else:
        target_hash = bytes.fromhex(sys.argv[1].strip())
    start_time = time.time()
    try_all(words, target_hash)
    end_time = time.time()
    total_time = end_time - start_time
    print(f"Time taken: {round(total_time, 5)} seconds")
    print(f'{total_time / pow(len(words), 2)} seconds per hash')

On my laptop this runs at about 1.1 * 10-6 seconds per hash, so trying all the words in the dictionary would take a little under 10 hours of CPU time.