Deduplicating .pst files to find unique emails

1k Views Asked by At

I have a (what seems like) a large task at hand.
I need to go through different archive volumes of multiple folders (we're talking terabytes of data). Within each folder is a .pst file. Some of these folders (and therefore files) may be exactly the same (name or data within the file). I want to be able to compare more than 2 files at once (if possible) to see if any dulpicates are found. Once the duplicates are found, I need to delete them and keep the originals and then eventually extract all the unique emails.

I know there are programs out there that can find duplicates, but I'm not sure what arguments they would need to pass in these files and I don't know if they can handle such large volumes of data. I'd like to program in either C# or VB. I'm at a loss on where I should start. Any suggestions??

Ex...

m:\mail\name1\name.pst

m:\mail\name2\name.pst (same exact data as the one above)

m:\mail\name3\anothername.pst (duplicate file to the other 2)
2

There are 2 best solutions below

0
On

If you just want to remove entire duplicate files the task is very simple to implement.

You will have to go through all your folders and hash the contents of each file. The hash produced has some bits (e.g 32 to 256 bits). If two file hashes are equal there is an extremely high probability (depending on the collision resistance of your hash function, read number of bits) that the respective files are identical.

Of course, now the implementation is up to you (I am not a C# or VB programmer) but I would suggest you something like the following pseudo-code (Next I explain each step and give you links demonstrating how to do it in C#):

    do{
       file_byte_array = get_file_contents_into_byte_array(file)          1
       hash = get_hash from_byte_array(file_byte_array);                  2
       if(hashtable_has_elem(hashtable,hash))                             3
         remove_file(file);                                               4
       else                                                               5
         hashtable_insert_elem(hashtable,hash,file);                      6
     }while_there_are_files_to evaluate                                   7

This logic should be executed over all of your .pst files. At line 1 (I assume you have your file opened) you write all the contents of your file into a byte array.

Once you have the byte array of your file, you must hash it using an hash function (line 2). You have plenty of hash functions implementations to choose. In some implementations you must break the file into blocks and hash each block contents (e.g here, here and here). Breaking your file in parts may be the only option, if your files are really huge and do not fit in your memory. On the other hand, you have many functions which accept the whole stream (e.g. here, here an example very similar to your problem,here, here, but I would advise you the super fast MurmurHash3). If you have efficiency requisites, stay away of cryptographic hash functions as they are much heavier and you do not need cryptographic properties to perform your task.

Finally, after computing the hash you just need to get some way in which you save the hashes and compare them, in order to find the identical hashes (read files) and delete them (lines 3-6). I purpose the use of a hash table or a dictionary, where the identifier (the object you use to perform lookups) is the file hash and the object File the entry value.

Notes:

Remember!!!: The more bits the hash value has, the lesser is the probability of collisions. If you want to know more about collision probabilities in hash functions read this excellent article. You must pay attention to this topic since your objective is to delete files. If you have a collision then you will delete one file which is not identical and you will loose it forever. There are many tactics to identify collisions, which you can combine and add to your algorithm (e.g. compare the size of your file, compare file content values at random positions, use more than one hash function). My advice would be to use all these tactics. If you use two hash functions, then for two files be considered identical they must have the hash value of each hash function equal:

    file1, file2;
    file1_hash1  = hash_function1(file1);
    file2_hash1  = hash_function1(file2);
    file1_hash2  = hash_function2(file1);
    file2_hash2  = hash_function2(file2);

    if(file1_hash1 == file2_hash1 &&
       file2_hash2 == file2_hash2)
      // file1 is_duplicate_of file2;
    else
      // file1 is_NOT_duplicate_of file2;
0
On

I would work thru the process of finding duplicates by first recursively finding all of the PST files, then match on file length, then filter by a fixed prefix of bytes, and finally performing full hash or byte comparison to get actual matches.

Recursively building the list and finding potential matches can be as simple as this:

Func<DirectoryInfo, IEnumerable<FileInfo>> recurse = null;
recurse = di => di.GetFiles("*.pst")
    .Concat(di.GetDirectories()
        .SelectMany(cdi => recurse(cdi)));

var potentialMatches =
    recurse(new DirectoryInfo(@"m:\mail"))
        .ToLookup(fi => fi.Length)
        .Where(x => x.Skip(1).Any());

The potentialMatches query gives you a complete series of potential matches by file size.

I would then use the following functions (which I'd leave the implementation to you) to filter this list further.

Func<FileInfo, FileInfo, int, bool> prefixBytesMatch = /* your implementation */
Func<FileInfo, FileInfo, bool> hashMatch = /* your implementation */

By limiting the matches by file length and then by a prefix of bytes you will significantly reduce the computation of hashes required for your very large files.

I hope this helps.