Decompressing (Gzip) chunks of response from http.client call

1.2k Views Asked by At

I have the following code that I am using to try to chunk response from a http.client.HTTPSConnection get request to an API (please note that the response is gzip encoded:

    connection = http.client.HTTPSConnection(api, context = ssl._create_unverified_context())
    connection.request('GET', api_url, headers = auth)
    response = connection.getresponse()
    while chunk := response.read(20):
        data = gzip.decompress(chunk)
        data = json.loads(chunk)
        print(data)

This always gives out an error that it is not a gzipped file (b'\xe5\x9d'). Not sure how I can chunk data and still achieve what I am trying to do here. Basically, I am chunking so that I don't have to load the entire response in memory. Please note I can't use any other libraries like requests, urllib etc.

2

There are 2 best solutions below

1
On BEST ANSWER

The most probable reason for that is, the response your received is indeed not a gzipped file.

I notice that in your code, you pass a variable called auth. Typically, a server won't send you a compressed response if you don't specify in the request headers that you can accept it. If there is only auth-related keys in your headers like your variable name suggests, you won't receive a gzipped response. First, make sure you have 'Accept-Encoding': 'gzip' in your headers.

Going forward, you will face another problem:

Basically, I am chunking so that I don't have to load the entire response in memory.

gzip.decompress will expect a complete file, so you would need to reconstruct it and load it entirely in memory before doing that, which would undermine the whole point of chunking the response. Trying to decompress a part of a gzip with gzip.decompress will most likely give you an EOFError saying something like Compressed file ended before the end-of-stream marker was reached.

I don't know if you can manage that directly with the gzip library, but I know how to do it with zlib. You will also need to convert your chunk to a file-like object, you can do that with io.BytesIO. I see you have very strong constraints on libraries, but zlib and io are part of the python default, so hopefully you have them available. Here is a rework of your code that should help you going on:

import http
import ssl
import gzip
import zlib
from io import BytesIO

# your variables here
api = 'your_api_host'
api_url = 'your_api_endpoint'
auth = {'AuhtKeys': 'auth_values'}

# add the gzip header
auth['Accept-Encoding'] = 'gzip'

# prepare decompressing object
decompressor = zlib.decompressobj(16 + zlib.MAX_WBITS)

connection = http.client.HTTPSConnection(api, context = ssl._create_unverified_context())
connection.request('GET', api_url, headers = auth)
response = connection.getresponse()

while chunk := response.read(20):
    data = decompressor.decompress(BytesIO(chunk).read())
    print(data)
1
On

The problem is that gzip.decompress expects a complete file, you can't just provide a chunk to it, because the deflate algorithm relies on previous data during decompression. The whole point of the algorithm is that it's able to repeat something that it has already seen before, therefore, all data is required.

However, deflate only cares about the last 32 KiB or so. Therefore, it is possible to stream decompress such a file without needing much memory. This is not something you need to implement yourself though, Python provides the gzip.GzipFile class which can be used to wrap the file handle and behaves like a normal file:

import io
import gzip

# Create a file for testing.
# In your case you can just use the response object you get.
file_uncompressed = ""
for line_index in range(10000):
    file_uncompressed += f"This is line {line_index}.\n"
file_compressed = gzip.compress(file_uncompressed.encode())
file_handle = io.BytesIO(file_compressed)

# This library does all the heavy lifting 
gzip_file = gzip.GzipFile(fileobj=file_handle)

while chunk := gzip_file.read(1024):
    print(chunk)