I know that this question has been asked before, and I've saw some of the answers, but this question is more about my code and the best way of accomplishing this task.
I want to scan a directory and see if there are any duplicates (by checking MD5 hashes) in that directory. The following is my code:
import sys
import os
import hashlib
fileSliceLimitation = 5000000 #bytes
# if the file is big, slice trick to avoid to load the whole file into RAM
def getFileHashMD5(filename):
retval = 0;
filesize = os.path.getsize(filename)
if filesize > fileSliceLimitation:
with open(filename, 'rb') as fh:
m = hashlib.md5()
while True:
data = fh.read(8192)
if not data:
break
m.update(data)
retval = m.hexdigest()
else:
retval = hashlib.md5(open(filename, 'rb').read()).hexdigest()
return retval
searchdirpath = raw_input("Type directory you wish to search: ")
print ""
print ""
text_file = open('outPut.txt', 'w')
for dirname, dirnames, filenames in os.walk(searchdirpath):
# print path to all filenames.
for filename in filenames:
fullname = os.path.join(dirname, filename)
h_md5 = getFileHashMD5 (fullname)
print h_md5 + " " + fullname
text_file.write("\n" + h_md5 + " " + fullname)
# close txt file
text_file.close()
print "\n\n\nReading outPut:"
text_file = open('outPut.txt', 'r')
myListOfHashes = text_file.read()
if h_md5 in myListOfHashes:
print 'Match: ' + " " + fullname
This gives me the following output:
Please type in directory you wish to search using above syntax: /Users/bubble/Desktop/aF
033808bb457f622b05096c2f7699857v /Users/bubble/Desktop/aF/.DS_Store
409d8c1727960fddb7c8b915a76ebd35 /Users/bubble/Desktop/aF/script copy.py
409d8c1727960fddb7c8b915a76ebd25 /Users/bubble/Desktop/aF/script.py
e9289295caefef66eaf3a4dffc4fe11c /Users/bubble/Desktop/aF/simpsons.mov
Reading outPut:
Match: /Users/bubble/Desktop/aF/simpsons.mov
My idea was:
1) Scan directory 2) Write MD5 hashes + Filename to text file 3) Open text file as read only 4) Scan directory AGAIN and check against text file...
I see that this isn't a good way of doing it AND it doesn't work. The 'match' just prints out the very last file that was processed.
How can I get this script to actually find duplicates? Can someone tell me a better/easier way of accomplishing this task.
Thank you very much for any help. Sorry this is a long post.
The obvious tool for identifying duplicates is a hash table. Unless you are working with a very large number of files, you could do something like this:
At the end of this process,
file_dict
will contain a list for every unique hash; when two files have the same hash, they'll both appear in the list for that hash. Then filter the dict looking for value lists longer than 1, and compare the files to make sure they're the same -- something like this:Or this:
get_file_hash
could use MD5s; or it could simply get the first and last bytes of the file as Ramchandra Apte suggested in the comments above; or it could simply use file sizes as tdelaney suggested in the comments above. Each of the latter two strategies are more likely to produce false positives though. You could combine them to reduce the false positive rate.If you're working with a very large number of files, you could use a more sophisticated data structure like a Bloom Filter.