I want to compare a number of files and find out which files which are the same, but they are not necessarily text files(So please don't suggest diff
)
The files can be in any format (ie binary files).
I found out that I can run md5sum
to find the hash of each file and then compare it manually to check if they are the same . But how can I automate this process ?
Ps : I also found that I can store the md5sums in a file using
md5sum <file-names> | cat >md5sum.txt
but I am stuck on how to automate this process.
I would prefer this to be done via a script (language no-bar).
If you can use languages like perl or python with builtin support for hashes/dictionnaries, it's really easy.
Loop over file names and signature and create a hash with md5sum as key and list of files with that md5 as value.
Then loop over content of hash and show entries with more than one item. These are files likely to be identical (you can't be really sure with a signature based approach).
As people are asking for code, maybe something like below. That is a perl implementation. I may add an equivalent python sample later if it is wanted.
Say you put that in a file same.pl, you call it like:
perl same.pl
exemple of use:
Below is a possible python version (working with both python2 and python3).
Note that if you are comparing really large number of files, providing file names on command line as in the above exemples may not be enough and you should use some more elaborate way to do that (or put some glob inside the script), or the shell command line will overflow.