Trying to determine if file has been uuencoded

2k Views Asked by At

I am trying to process a large collection of txt files which themselves are containers for the actual files that I am wanting to process. The txt files have sgml tags that set boundaries for the individual files I am processing. Sometimes, the contained files are binary that have been uuencoded. I have solved the problem of decoding the uuencoded files but as I was mulling over my solution I have determined that it is not general enough. That is, I have been using

if '\nbegin 644 ' in document['document']

to test if the file is uuencoded. I did some searching and have a vague understanding of what the 644 means (file permissions) and have then found other examples of uuencoded files that might have

if '\nbegin 642 ' in document['document']

or even some other alternates. Thus, my problem is how do I make sure that I capture/identify all of the subcontainers that have uuencoded files.

One solution is to test every subcontainer:

uudecode=codecs.getdecoder("uu")

for document in documents:
    try:
        decoded_document,m=uudecode(document)
    except ValueError:
         decoded_document=''
    if len(decoded_document)==0
        more stuff

This is not horrible, cpu-cycles are cheap but I am going to be handling some 8 million documents.

Thus, is there a more robust way to recognize whether or not a particular string is the result of uuencoding?

2

There are 2 best solutions below

3
On BEST ANSWER

Wikipedia says that every uuencoded file begins with this line

begin <perm> <name>

So probably a line matching the regexp ^begin [0-7]{3} (.*)$ denotes the beginning reliably enough.

1
On

Two ways:

(1) On Unix-based systems, you can robustly use the file command.

http://unixhelp.ed.ac.uk/CGI/man-cgi?file

$ file foo
foo: uuencoded or xxencoded text

(2) I also found the following (untested) Python code that looks like it will do what you want (at http://ubuntuforums.org/archive/index.php/t-1304548.html).

#!/usr/bin/env python
import magic
import sys
filename=sys.argv[1]
ms = magic.open(magic.MAGIC_NONE)
ms.load()
ftype = ms.file(filename)
print ftype
ms.close()