.gz archive have the content-type identified wrong?

25 Views Asked by At

I'm working on an API that handles uploaded images. Images can be both .jpg files and .gz archives.

url = 'http://example.com/upload'
file_path = 'path/to/my/file.gz'
files = {'file': open(file_path, 'rb')}
response = requests.post(url, files=files)

How can I correctly identify if the file is a jpg or a gz archive?

def post(self, request):  
    for _, file_data in request.FILES.items():
        print(file_data.content_type)
        if file_data.content_type == 'application/gzip':
            # do something
        elif file_data.content_type.startswith('image/'):
            # do something

The problem with this code is that after printing, it displays 'application/octet-stream' and I don't understand why.

1

There are 1 best solutions below

0
willeM_ Van Onsem On

The .content_type [Django-doc] is not a MIME-type checked by Django/Python, it is what the browser says. If it does not really know, or care, or is forged, it thus differs, as is specified in the documentation:

The content-type header uploaded with the file (e.g. text/plain or application/pdf). Like any data supplied by the user, you shouldn't trust that the uploaded file is actually this type. You'll still need to validate that the file contains the content that the content-type header claims - "trust but verify."

We can try to guess the mimetype based on the content of the file with python-magic [pypi.org]:

import magic

mime = magic.Magic(mime=True)
result = mime.from_descriptor(file_data.open())

beware that this reads the uploaded file, so streamed uploads might "eat" parts of the stream, and thus could prevent then using the file after determining its type.