How to reliable tell the uploaded file type (text or binary)?

1.3k Views Asked by At

I have an application where users should be able to upload a wide variety of files, but I need to know for each file, if I can safely display its textual representation as plain text.

Using python-magic like

m = Magic(mime=True).from_buffer(cgi.FieldStorage.file.read())

gives me the correct MIME type.

But sometimes, the MIME type for scripts is application/*, so simply looking for m.startswith('text/') is not enough.

Another site suggested using

m = Magic().from_buffer(cgi.FieldStorage.file.read())

and checking for 'text' in m.

Would the second approach be reliable enough for a collection of arbitrary file uploads or could someone give me another idea?

Thanks a lot.

2

There are 2 best solutions below

5
On

What is your goal? Do you want the real mime type? Is that important for security reasons? Or is it "nice to have"?

The problem is that the same file can have different mime types. When a script file has a proper #! header, python-magic can determine the script type and tell you. If the header is missing, text/plain might be the best you can get.

This means there is no general "will always work" magic solution (despite the name of the module). You will have to sit down and think what information you can get, what it means and how you want to treat it.

The secure solution would be to create a list of mime types that you accept and check them with:

allowed_mime_types = [ ... ]
if m in allowed_mime_types:

That means only perfect matches are accepted. It also means that your server will reject valid files which don't have the correct mime type for some reason (missing header, magic failed to recognize the file, you forgot to mention the mime type in your list).

Or to put it another way: Why do you check the mime type of the file if you don't really care?

[EDIT] When you say

I need to know for each file, if I can safely display its textual representation as plain text.

then this isn't as easy as it sounds. First of all, "text" files have no encoding stored in them, so you will need to know the encoding that the user used when they created the file. This isn't a trivial task. There are heuristics to do so but things get hairy when encodings like ISO 8859-1 and 8859-15 are used (the latter has the Euro symbol).

To fix this, you will need to force your users to either save the text files in a specific encoding (UTF-8 is currently the best choice) or you need to supply a form into which users will have to paste the text.

When using a form, the user can see whether the text is encoded correctly (they see it on the screen), they can fix any problems and you can make sure that the browser sends you the text encoded with UTF-8.

If you can't do that, your only choice is to check for any bytes below 0x20 in the input with the exception of \r, \n and \t. That is a pretty good check for "is this a text document".

But when users use umlauts (like when you write an application that is being used world wide), this approach will eventually fail unless you can enforce a specific encoding on the user's side (which you probably can't since you don't trust the user).

[EDIT2] Since you need this to check actual source code: If you want to make sure the source code is "safe", then parse it. Most languages allow to parse the code without actually executing it. That would give you some real information (because the parsers know what to look for) and you wouldn't need to make wild guesses :-)

4
On

After playing around a bit, I discovered that I can propably use the Magic(mime_encoding=True) results!

I ran a simple script on my Dropbox folder and grouped the results both by encoding and by extension to check for irregularities.

But it does seem pretty usable by looking for 'binary' in encoding.

I think I will hang on to that, but thank you all.